CWI: Centrum voor Wiskunde en Informatica
P.O. Box 94079
1090 GB Amsterdam
Network browsers such as Mosaic and Netscape have revolutionized the on-line access of electronic documents. Using established protocols and the existing World Wide Web, these graphics-based interfaces have transformed the colorless activities of remote access and file transfer into (inter)national pastimes. In so doing, they have sparked an interest in pushing the types of data that is fetched and presented over the Internet far beyond the dreams of its creators in the late 1960's and early 1970's. At present, this `desire for more' manifests itself in the form of time-sensitive, high-volume data such as audio and video.
Unfortunately, the architecture of the Internet and the basic structure of HMTL documents fetched via the http/https protocols is not suited to the delivery and presentation of general audio and video data. Where the current WWW provides access to documents containing multiple types of media, it does not support the delivery of true multimedia documents: documents containing one or more isosynchronous media streams. The problem of supporting multimedia documents goes beyond adding information bandwidth to the Internet or increasing the number of data formats supported by a browser it has to do with the synchronization of information and the content-based scheduling of complex media objects.
In this position paper, a perspective is given on the appropriate architecture for a multimedia-based WWW based on experience supporting the definition and delivery of networked multimedia in the CMIF project at CWI in Amsterdam.(1) (Luckily!) much work needs to be done before a general solution is in place for world-wide use, but the approach described can be a useful basis for structuring information and processing support.
In order to provide a basis for discussion, assume the following networked multimedia problem: we would like to play a three minute video that stored on a remote server in `real time' on our local presentation host. (By `real time,' we mean that the presentation should occur at the video-defined frame rate, with no extra inter-frame delay.) The video presentation contains the following components:
The entire video is `packaged' as part of a Web-like document that is, the video is shown as if it were available on someone's web page. (We call this type of video embedded video: the video is not the central purpose of the presentation, but something can is included within some larger context.)
Solutions to the problem of supporting such a video depend on how one chooses to view the problem. Fig. 1 shows two such views: the composite data view and the discrete data view. In the composite data view, all of the component media types are encoded as a single media stream; presenting the video means accessing the video server that contains the video file, transporting that file across the network, and displaying it locally. This task is not trivial, but it is well understood: depending on the characteristics of the server, of the network and of the (local) presentation host, we can easily determine if the video can be presented by controlling a single data stream. (In effect, this approach is no different than that used to display an image inside an HTML document now albeit a pretty complex image!)
Figure 1. Two extremes in fetching complex multimedia objects.
The problem with the composite data view is that is does not offer the user many alternatives above those available with broadcast television: you can tune in a page, you can watch a `program' and you can turn the presentation off. The discrete data model provides other possibilities. If the video, music, voice-over and sub-title media streams all were separate objects, the presentation could be tailored to meet the needs of the viewer. For example:
Note that the selection of the desired alternatives for the cases above could be done as part of the access protocol or by specific user action. The authoring environment also has a critical role to play.
While discrete data presentations require complex data management, the benefits of the approach are considerable. Philosophically, the unbundling of presentations into component parts provides a clear added value over composite videos for the user and for the support environment. The user is given the potential for customized presentations and the support environment is given a number independent streams of information that can be separately managed by optimized presentation components. We consider the support issues in the following sections.
In the CMIF project at CWI, we have investigated discrete data models for supporting networked multimedia since 1991. The basic principles in this work have been:
1) multi-sourced data: we have assumed that pieces of a presentation can be located across a network infrastructure. At each site an `appropriate' storage server (ranging from a database system to a local file system) provides access to the data objects;
2) multi-stream presentation: a single presentation can be divided into a number of logical streams (which we call channels); each channel has a particular information service policy associated with it and a number of channel-wide attributes that dictate how information is presented on a particular class of platforms;
3) heterogeneous presentation platforms: we live in a heterogeneous world. On any given day, depending on where I am, I may use three or four different information presentation platforms. On each platform, I want to maximize the presentation of information rather than being restricted to a single common denominator machine. (This is why I use multiple computers!)
4) inter- and intra-stream synchronization support: it is clear that information in a singe stream (or channel) may need to be synchronized such as in an audio stream. It is also clear that two separate streams may need to be synchronized such as in sub-titled audio data. This synchronization needs to be flexible so that it can be adapted to the capabilities of different platforms. (Note that this adaptive behavior usually needs to be defined by a presentation author rather than a generic browser.);
The CMIF document specification is split into three separate but closely communicating views: the hierarchy view, the channel view and the runtime (or player) view. The hierarchy view provides structure-based control of the presentation. The channel view provides a modified timeline specification, which describes timing information and logical resource usage on a per-stream basis. Control flow information is overlaid directly onto the channel view by using a series of synchronization constraint descriptors (called synchronization arcs.) The runtime view is actually a series of presentation controls and data rendering operations; it allows user navigation using the link structure of the Amsterdam Hypermedia Model. Note that no explicit programming-based view is used within the CMIF approach.
A presentation is composed by defining the structure of the presentation, and assigning the appropriate media items to the structure. Media items are created using external editor(s), available directly from within the authoring environment. The purpose of the authoring phase is to build a presentation specification that defines the alternatives available to the user (in terms of content choices) and the runtime environment (in terms of performance support). By having multiple views of a document, the runtime environment can schedule current and future media events while the document is playing. This means that information can be prefetched at a rate that is required at runtime (based on network and local system performance).
The use of channels is illustrated in Fig. 2. Here we see a screen with multiple independent channels. Not all channels need to be active at the same time, but all possible data streams are defined by the document's author. The key aspect of this approach is that a higher level document described the alternatives available within the document, which allows a local scheduler to use a system-dependent policy to present the information as efficiently as possible.
Figure 2. The use of channels for the video page.
The relevant aspects of our CMIF work for networked multimedia can be summarized as follows:
By cracking documents open, the potential exists for more reasonable forms of user navigation and selection, and a real alternative to the analog video player is defined. While there will always be reasons to produce special-purpose single stream videos, this special case should not be used as the architecture of the general case!
A final thought: a complicating factor in electronic presentations is the presence of hyperlinks in the documents. Having a separate specification doesn't remove this problem, but it does allow informed choices to be made before a particular link is taken. If 90% of the readers follow a particular link pattern, than the scheduler can use this to decide which data objects to fetch and pre-load. This does not mean that all jumps will be smooth, but it does mean that the delay inherent in a hypermedia system can be managed.
Information on CWI's network-based multimedia research can be found at our anonymous FTP site: ftp.cwi.nl:pub/mmpapers.
Our work forms the basis for the ESPRIT-IV project CHAMELEON (project 20597). Information on this three-year research effort on multimedia authoring can be found at: