Real Time Multimedia Activity, February 10, 1997

Briefing Package
For Working Group on Synchronized Multimedia

1. Executive Summary

Web technology is limited today when it comes to creating continuous multimedia presentations. For these applications, content authors need to express things like "five minutes into the presentation, show image X and keep it on the screen for ten seconds". More generally speaking, there must be a way to describe the synchronization between the different media (audio, video, text, images) that make up a continuous multimedia presentation.

There is an imminent danger that a plethora of non-interoperable solutions for integrating real-time multimedia content into the Web architecture will emerge. These different solutions will most likely not result from a healthy competition advancing technological progress. In contrast, they will result from a simple lack of communication between the three very different communities involved, namely the Web community, the CD-ROM community and the community working on Internet-based audio/video-on-demand.

Representatives from each community participated at the recent W3C workshop on "Real Time Multimedia and the Web". In the feedback we received after this event, members of all communities, including several key players, reported that they see W3C as a promising forum for exchanging ideas and for finding consensus on common solutions for integrating synchronized multimedia presentations into the Web.

A synergy of their orthogonal expertise holds the promise that a single, sound technical solution can be found for many of the issues of real-time multimedia content on the Web. Such agreements are the necessary signal for independent content providers to start creating synchronized multimedia content for the Web, and, thus, for market growth in this area.

The working group proposed in this briefing package should devise a format for describing synchronized multimedia presentations. This includes both specifying such a format and developing several interoperable implementations. Submission of a W3C recommendation is planned for the end of 97.

2. Background

Synchronized Multimedia Applications

Many industry analysts predict that the Web will be turned into a distribution system for both interactive and continuous multimedia content, or synchronized multimedia content. A typical example of synchronized multimedia are presentations created with tools like Microsoft's Powerpoint or Macromedia's Authorware. Other examples can be found on CD-ROM products like training courses, lexica, "virtual" art galleries or CDs of pop groups enhanced by text, images and video. For instance, in a "guided tour", the screen shows a sequence of different images, text and graphs, which are explained in an audio stream that is played in parallel.

Actually, many of today's television programs are multimedia presentations, and could be produced by using techniques known from multimedia CD-ROMs. Consider a television news broadcast: many parts of the television screen contain text, images, graphs or other static elements. These static elements could be sent as separate files, together with a schedule that determines at which point in time the content of these files should be displayed on the screen. This may require much less bandwidth than sending the same content as full motion video.

Interestingly enough, the graphic design of some television programs already employs elements of CD-ROM and computer user interfaces - windows pop up in a news broadcast or program items are presented in menu form. The much-heralded integration of television and computers thus seems to be already on its way, at least on the level of graphic design of television programs. Of course, television programs lack the interactivity known from Web and CD-ROM content. For example, a user cannot simply interrupt the transmission of a news program by clicking on a photograph to retrieve more information about a person.

Providers of synchronized multimedia content are interested in using the Web for several reasons. First, the Web greatly facilitates updating content. Therefore, more up-to-date information can be offered via the Web than on a CD-ROM. Moreover, distribution of content via the Web is cheaper than distributing CD-ROMs. Finally, consumers that use CD-ROMs do not need to buy a new end-user appliance for accessing the same content via the Internet. Both technologies require the same end-user appliance, namely a personal computer. With the forthcoming high-speed Internet links, it seems likely that in the near future Web technology will assimilate the content that is distributed on CD-ROMs today.

Requirements

A comparison between CD-ROM products and typical Web sites reveals that there are already many similarities. First of all, image and text media types are very important for both Web and CD-ROM content. Moreover, the user interfaces are also very similar. On CD-ROMs, the user navigates through the multimedia content using a "point-and-click" interface. On the Web, this sort of interface is achieved by using hyperlinks (URLs), possibly associated with image maps.

However, Web technology is clearly limited today when it comes to creating continuous multimedia presentations. For this, two new components need to be added to Web technology : a format for authoring synchronized multimedia content, and support for network transmission of this type of content.

Authors of continuous multimedia presentations need to express things like "five into the presentation, show image X at the lower right corner of the screen. For the next ten seconds, move it to the upper right corner of the screen. While doing this, play the last ten seconds of Vivaldi's 'Spring time'". More generally speaking, the format must allow to describe the positioning and the synchronization between the different media (text, images, audio, video) that make up a continuous multimedia presentation.

Transmitting synchronized multimedia content over the Internet requires meeting the real-time constraints of the presentation, i.e. a particular piece of a file has to be there when it is needed in the presentation. There are two solutions for addressing this issue: prefetching of files and using a streaming protocol. The current transport protocol used by the Web (HTTP) is usable for prefetching files, but not as a streaming protocol.

Streaming protocols are particularly important for audio and video files. Popular applications for Internet-based audio/video-on-demand such as RealAudio (http://www.realaudio.com) and VDOLive (http://www.vdo.com)) rely on streaming protocols to cope with the real-time constraints of audio and video replay. Standard protocols for streaming applications are developed within the IETF (RTP, RTSP).

The Case for a Web-based Solution

One possible solution for enabling synchronized multimedia presentations on the Web is to use a format that restricts the overlap with existing Web technology to the use of URL-based addressing. This approach has been followed by many products currently on the market. Examples are  Macromedia's Shockwave (http://www.macromedia.com), Apple's Quicktime plugin (http://www.apple.quicktime.com)), Microsoft's NetShow (http://www.microsoft.com/netshow) and Pointcast (http://www.pointcast.com/) when operating in "screen saver" mode.

From the point of view of a content provider, this approach has the advantage that existing authoring tools for some of these formats can be used for creating synchronized multimedia content for the Web. However, it also has a number of drawbacks.

First, the potential audience of the content will have a more difficult time of finding the content, since the text part of the content is in a format unknown to Web search engines, and thus cannot be indexed.

Moreover, the Web allows the creation of new content by reusing existing raw material, which can be stored on different servers all over the Internet. Two of the most successful Web applications - search engines and Web directories - only work because today's Web content is reusable. In contrast, content encoded in CD-ROM formats is generally not reusable, since the image, text and other files used in CD-ROM content are often wrapped into a single file when put on the Web.

Internet-based audio/video-on-demand applications are currently being extended to also "stream" other media types such as text and images, and not only audio and video. Examples are Netscape's MediaServer (http://www.netscape.com/comprod/announce/media.html) and work at Progressive Networks on extensions of the RealAudio product (RealMedia Architecture (http://www.realaudio.com/prognet/rm/index.html). These applications require a format for synchronizing different media types contained in a presentation, and standard formats for basic data types such as text and images. Work has already started on designing application-specific formats.

The Web community is using open formats that can duplicate some of the functionality of CD-ROM formats today (image maps, HTML, URLs), and is working on formats that will bring the capabilities of the Web even closer to those of CD-ROM formats (layout control, stylesheets, fonts, scripting). Content providers tend to prefer open formats, since they can be relatively sure that their content will be "durable". If content providers start to produce synchronized multimedia content for the Web, the market for products to author, display and serve this type of Web content will also grow.

A Plan of Attack

The key to a solution is to make individual elements of an HTML page addressable. For example, each paragraph in a document could have an individual identifier. This allows to express things like "ten seconds into the presentation, remove the second paragraph from the screen". It also allows to control audio- and video files, which can be included into the HTML file using the object element (http://www.w3.org/pub/WWW/MarkUp/Group/9612/WD-object-961216.html). This has already been realized (see e.g. Microsoft's proposal for an object model for HTML (http://www.w3.org/pub/WWW/MarkUp/Group/webpageapi/) or the animation based on the "layer" tag in Netscape 4.0 (http://home.netscape.com/comprod/products/communicator/index.html)).

However, in contrast to the rest of the Web formats which are declarative (HTML, CSS), media synchronization on the Web can currently only be expressed by using a scripting language such as JavasScript or VisualBasic.

Declarative formats have a set of well-known advantages over scripting languages. For instance, script-based content is often hard to produce and maintain. Having to write a program in order to, say, express the synchronization of an audio track with a video track seems overly complicated. Moreover, it is easier to build automated tools for a deslarative format than for a scripting language.

Therefore, CD-ROM authoring tools use declarative formats such as Apple's Quicktime as an alternative approach to well-known scripting languages such as Macromedia's Lingo or Apple's Hypercard. Both declarative and scripting approaches have succeeded in the CD-ROM marketplace, and are often used side-by-side in a particular multimedia product. Some of the most important contributions in this field have been made by several W3C member organisations.

The ease of building converters can be very important to bring synchronized multimedia presentations to the Web. Using a declarative approach allows to build format converters that allows to put existing content on the Web, similar to today's HTML converters for well-known word-processors. In the other direction, a simple conversion from a declarative Web format into an existing format may allow to reuse existing players as display engine for a Web based  multimedia presentations.

In summary, experience from both the CD-ROM community and from the Web community suggests that it would be beneficial to adopt a declarative format for expressing media synchronization on the Web as an alternative and complement to scripting languages.

2. Status of Activity within W3C

This is a new activity within W3C. However, it has interactions with many other W3C Activities, since W3C already works on open standards than can serve as powerful building blocks for real-time multimedia applications (HTML (image maps, object element) , HTTP, URLs, CSS1 stylesheets (layout control, font management), graphics).

3. Proposal: Working Group on Synchronized Multimedia

Scope of the Activity

Charter

The working group should develop a declarative format for describing synchronized multimedia presentations. The functionality of this format should be comparable to today's declarative CD-ROM formats. This includes both the specification of such a format, and the development of two client implementations and two server implementations which are interoperable. These four implementations must be developed independently, i.e. by four different organisations.

This working group effort should be developed and stabilized within 9 months since there has been considerable prior work done in this area. This prior body of work includes: ActiveMovie, CMIF, existing Web technology (HTML, image maps, URLs, layout control, style sheets, fonts), HyTime, Madeus, MHEG-1, MHEG-5, Quicktime.

The working group will focus on harmonizing, evolving and developing a format based on this work. The expiration date of the working group is end of December '97.

Relations with other groups

An overlap already exists between the functionality found in Web technology and the functionality found in synchronized multimedia presentations. Therefore, a reasonable working hypothesis for W3C work in this area is that synchronized multimedia will be integrated into the Web by using extensions and additions to the basic Web content technologies for producing content, such as HTML, Web image formats, image maps and URLs.

Other groups working on real-time multimedia make quite different assumptions. For instance, the Java community uses Java byte-code as distribution format for real-time multimedia content, and Java-code generators for producing this content. DAVIC (Digital Audio Video Council (http://www.davic.org)) - a standards body for interactive TV - uses MHEG-5 as their base format, and uses only a subset of HTML for implementing hypertext functionality.

The primary purpose of this working group is not developing new network transmission protocols. Given that Web content is primarily transported over the Internet, it will be assumed that network transmission will mainly be achieved via protocols developed within the IETF, in particular RTP, RTSP and IP multicast, but also TCP. However, the study of application scenarios and the design of the format may reveal new protocol requirement. In this case, the working group will coordinate its work with the IETF . This is important for two reasons: first, the RTSP specification is still under development at IETF, and it can be expected that applications using the stream synchronization format developed by this group will also be a prime user of RTSP. Second, errors like the "abuse" of IETF protocols in the first version of HTML/HTTP should be avoided.

Resource Statement

W3C Resource Commitment

This activity will consume 75% or about six person-months of the time of one W3C staff member for project management and coordination. This will be funded out of the current W3C budget.

Member Resource Commitment

The working group should comprise about fifteen engineers of key players in the CD-ROM, Web and Internet-Audio/Video-on-Demand communities. Participants should be people that actually implement the specification developed by this group. Participation will require 20% of the time of an engineer at minimum. The current working group schedule is for eight months. The overall resource consumption of this working group will thus be 2.75 person-years. If these conditions cannot be met, the Activity will be delayed, or will even have to be cancelled.

Projected Schedule

Window of Opportunity

There is an imminent danger that a plethora of non-interoperable solutions for integrating real-time multimedia content into the Web architecture will emerge. These different solutions will most likely not result from a healthy competition advancing technological progress. In contrast, they will result from a simple lack of communication between the three very different communities involved, namely the Web community, the CD-ROM community and the community working on Internet-based audio/video-on-demand.

A synergy of their orthogonal expertise holds the promise that a single, sound technical solution can be found for many of the issues of real-time multimedia content on the Web. Such agreements are the necessary signal for independent content providers to start creating real-time multimedia content for the Web, and, thus, for market growth in this area.

Since several W3C members are shipping or plan to ship products in the near future, this working group should come up to a conclusion relatively quickly. In addition, there has been considerable prior work done in this area. The working group effort should therefore be developed and stabilized within 9 months.

Goals and Milestones

4. Annex

Existing Information

Intellectual Property

Any intellectual property brought into this Activity, as well as intellectual property created in the Activity must at least be available for licensing for a reasonable fee and in a non-discriminatory manner. More stringent requirements are at the discretion of the Director of W3C.

Acknowledgments

Many of the ideas contained in this document are based on presentations and discussions at the W3C workshop on "Real-time multimedia and the Web". Special thanks goes to the workshop participants that replied to our request and gave us extensive written feedback on the future directions of W3C work:


Philipp Hoschka
Webmaster