Towards Synchronized Multimedia on the Web

Philipp Hoschka

1. Introduction

Web technology is limited today when it comes to creating continuous multimedia presentations. For these applications, content authors need to express things like "five minutes into the presentation, show image X and keep it on the screen for ten seconds". More generally speaking, there must be a way to describe the synchronization between the different media (audio, video, text, images) that make up a continuous multimedia presentation.

There is an imminent danger that a plethora of non-interoperable solutions for integrating real-time multimedia content into the Web architecture will emerge. These different solutions will most likely not result from a healthy competition advancing technological progress. In contrast, they will result from a simple lack of communication between the three very different communities involved, namely the Web community, the CD-ROM community and the community working on Internet-based audio/video-on-demand.

Representatives from each community participated at the recent W3C workshop on "Real Time Multimedia and the Web". In the feedback we received after this event, members of all communities, including several key players, reported that they see W3C as a promising forum for exchanging ideas and for finding consensus on common solutions for integrating synchronized multimedia presentations into the Web.

A synergy of their orthogonal expertise holds the promise that a single, sound technical solution can be found for many of the issues of real-time multimedia content on the Web. Such agreements are the necessary signal for independent content providers to start creating synchronized multimedia content for the Web, and, thus, for market growth in this area.

2. W3C Workshop "Real-Time Multimedia and the Web"

Many of the observations in this paper are based on the results of the W3C workshop on "Real-Time Multimedia and the Web". This workshop took place at the W3C site at INRIA Sophia-Antipolis on October 24 and 25 1996. All position papers, and detailed minutes are available on the Web via the URL http://www.w3.org/pub/WWW/AudioVideo/RTMW96.html. The workshop was transmitted live on the MBone using IP multicast.

About 70 participants had registered, among them the project manager of Macromedia Director and Shockwave (Jonathan Grayson, Macromedia), the chair of the IETF working group on conference control (Mark Handley, ISI), a chief architect of Apple Quicktime (Peter Hoddie, Apple), the chair of the MHEG-5 working group (Klaus Hofmeister, GMD), a co-author of RTSP (Rob Lanphier, Progressive Networks), one of the editors of the RTP standard (Henning Schulzrinne, Columbia University), the former chair of the MPEG Systems group (Jan van der Meer, Philips), the chair and the deputy director of W3C (Vincent Quint, INRIA). 25% of the participants came from the US and 53% come from industry, 47% from research organizations.

The first day of this two-day event dealt with multimedia formats. Morning presentations both from industry (Alcatel, Macromedia, Apple) and from research organizations (GMD, University of Massachusetts/Lowell, INRIA Grenoble) set the stage for the discussions in the breakout sessions in the afternoon. The day ended with a sessions on "wild ideas and strong opinions" featuring talks on integrating games into the Web (University of Oslo), a caching infrastructure (Oracle) and a call for action to browser developers for better integration of live media (University of Ulm).

The hot topic of the second day was audio and video transmission on the Internet. It included a presentation of RTSP (Real-Time Streaming Protocol) by one of the co-authors of this standard. RTSP has been proposed by Netscape and Progressive Networks as a standard protocol for accessing real-time multimedia sources on the Internet. The first set of presentations discussed the feasibility of A/V transmission on the Internet (University of Lancaster, Columbia University, Nemesys). This was followed by a session in which interactive television experts (Philips) discussed with Internet experts (Lulea University, UCL) whether digital television services could be realized using Web technology. The second day finished with a discussion on the future directions of W3C work in the area of real-time multimedia on the Web.

2. Synchronized Multimedia Applications

Many industry analysts predict that the Web will be turned into a distribution system for both interactive and continuous multimedia content, or synchronized multimedia content. A typical example of synchronized multimedia are presentations created with tools like Microsoft's Powerpoint or Macromedia's Authorware. Other examples can be found on CD-ROM products like training courses, lexica, "virtual" art galleries or CDs of pop groups enhanced by text, images and video. For instance, in a "guided tour", the screen shows a sequence of different images, text and graphs, which are explained in an audio stream that is played in parallel.

Actually, many of today's television programs are multimedia presentations, and could be produced by using techniques known from multimedia CD-ROMs. Consider a television news broadcast: many parts of the television screen contain text, images, graphs or other static elements. These static elements could be sent as separate files, together with a schedule that determines at which point in time the content of these files should be displayed on the screen. This may require much less bandwidth than sending the same content as full motion video.

Interestingly enough, the graphic design of some television programs already employs elements of CD-ROM and computer user interfaces - windows pop up in a news broadcast or program items are presented in menu form. The much-heralded integration of television and computers thus seems to be already on its way, at least on the level of graphic design of television programs. Of course, television programs lack the interactivity known from Web and CD-ROM content. For example, a user cannot simply interrupt the transmission of a news program by clicking on a photograph to retrieve more information about a person.

Providers of synchronized multimedia content are interested in using the Web for several reasons. First, the Web greatly facilitates updating content. Therefore, more up-to-date information can be offered via the Web than on a CD-ROM. Moreover, distribution of content via the Web is cheaper than distributing CD-ROMs. Finally, consumers that use CD-ROMs do not need to buy a new end-user appliance for accessing the same content via the Internet. Both technologies require the same end-user appliance, namely a personal computer. With the forthcoming high-speed Internet links, it seems likely that in the near future Web technology will assimilate the content that is distributed on CD-ROMs today.

3. Requirements

A comparison between CD-ROM products and typical Web sites reveals that there are already many similarities. First of all, image and text media types are very important for both Web and CD-ROM content. Moreover, the user interfaces are also very similar. On CD-ROMs, the user navigates through the multimedia content using a "point-and-click" interface. On the Web, this sort of interface is achieved by using hyperlinks (URLs), possibly associated with image maps.

However, Web technology is clearly limited today when it comes to creating continuous multimedia presentations. For this, two new components need to be added to Web technology : a format for authoring synchronized multimedia content, and support for network transmission of this type of content.

Authors of continuous multimedia presentations need to express things like "five into the presentation, show image X at the lower right corner of the screen. For the next ten seconds, move it to the upper right corner of the screen. While doing this, play the last ten seconds of Vivaldi's 'Spring time'". More generally speaking, the format must allow to describe the positioning and the synchronization between the different media (text, images, audio, video) that make up a continuous multimedia presentation.

Transmitting synchronized multimedia content over the Internet requires meeting the real-time constraints of the presentation, i.e. a particular piece of a file has to be there when it is needed in the presentation. There are two solutions for addressing this issue: prefetching of files and using a streaming protocol. The current transport protocol used by the Web (HTTP) is usable for prefetching files, but not as a streaming protocol.

Streaming protocols are particularly important for audio and video files. Popular applications for Internet-based audio/video-on-demand such as RealAudio (http://www.realaudio.com) and VDOLive (http://www.vdo.com)) rely on streaming protocols to cope with the real-time constraints of audio and video replay. Standard protocols for streaming applications are developed within the IETF (RTP, RTSP).

4. The Case for a Web-Based Solution

One possible solution for enabling synchronized multimedia presentations on the Web is to use a format that restricts the overlap with existing Web technology to the use of URL-based addressing. This approach has been followed by many products currently on the market. Examples are  Macromedia's Shockwave (http://www.macromedia.com), Apple's Quicktime plugin (http://www.apple.quicktime.com)), Microsoft's NetShow (http://www.microsoft.com/netshow) and Pointcast (http://www.pointcast.com/) when operating in "screen saver" mode.

From the point of view of a content provider, this approach has the advantage that existing authoring tools for some of these formats can be used for creating synchronized multimedia content for the Web. However, it also has a number of drawbacks.

First, the potential audience of the content will have a more difficult time of finding the content, since the text part of the content is in a format unknown to Web search engines, and thus cannot be indexed.

Moreover, the Web allows the creation of new content by reusing existing raw material, which can be stored on different servers all over the Internet. Two of the most successful Web applications - search engines and Web directories - only work because today's Web content is reusable. In contrast, content encoded in CD-ROM formats is generally not reusable, since the image, text and other files used in CD-ROM content are often wrapped into a single file when put on the Web.

Internet-based audio/video-on-demand applications are currently being extended to also "stream" other media types such as text and images, and not only audio and video. Examples are Netscape's MediaServer (http://www.netscape.com/comprod/announce/media.html) and work at Progressive Networks on extensions of the RealAudio product (RealMedia Architecture (http://www.realaudio.com/prognet/rm/index.html). These applications require a format for synchronizing different media types contained in a presentation, and standard formats for basic data types such as text and images. Work has already started on designing application-specific formats.

The Web community is using open formats that can duplicate some of the functionality of CD-ROM formats today (image maps, HTML, URLs), and is working on formats that will bring the capabilities of the Web even closer to those of CD-ROM formats (layout control, stylesheets, fonts, scripting). Content providers tend to prefer open formats, since they can be relatively sure that their content will be "durable". If content providers start to produce synchronized multimedia content for the Web, the market for products to author, display and serve this type of Web content will also grow.

5. A Plan of Attack

The key to a solution is to make individual elements of an HTML page addressable. For example, each paragraph in a document could have an individual identifier. This allows to express things like "ten seconds into the presentation, remove the second paragraph from the screen". It also allows to control audio- and video files, which can be included into the HTML file using the object element (http://www.w3.org/pub/WWW/MarkUp/Group/9612/WD-object-961216.html).

Approaches to address HTML elements in order to use them in synchronized multimedia presentations have already been developed (see e.g. Microsoft's proposal for an object model for HTML (http://www.w3.org/pub/WWW/MarkUp/Group/webpageapi/) or the animation based on the "layer" tag in Netscape 4.0 (http://home.netscape.com/comprod/products/communicator/index.html)).

However, in contrast to the rest of the Web formats which are declarative (HTML, CSS), media synchronization on the Web can currently only be expressed by using a scripting language such as JavasScript or VisualBasic.

Declarative formats have a set of well-known advantages over scripting languages. For instance, script-based content is often hard to produce and maintain. Having to write a program in order to, say, express the synchronization of an audio track with a video track seems overly complicated. Moreover, it is easier to build automated tools for a deslarative format than for a scripting language.

Therefore, CD-ROM authoring tools use declarative formats such as Apple's Quicktime as an alternative approach to well-known scripting languages such as Macromedia's Lingo or Apple's Hypercard. Both declarative and scripting approaches have succeeded in the CD-ROM marketplace, and are often used side-by-side in a particular multimedia product. Some of the most important contributions in this field have been made by several W3C member organisations.

The ease of building converters can be very important to bring synchronized multimedia presentations to the Web. Using a declarative approach allows to build format converters that allows to put existing content on the Web, similar to today's HTML converters for well-known word-processors. In the other direction, a simple conversion from a declarative Web format into an existing format may allow to reuse existing players as display engine for a Web based  multimedia presentations.

In summary, experience from both the CD-ROM community and from the Web community suggests that it would be beneficial to adopt a declarative format for expressing media synchronization on the Web as an alternative and complement to scripting languages.