This activity statement presents an analysis and sumary of the results of a series of meetings that were organized by W3C in this area (BOF at the WWW4 conference, Developer's day at WWW5 conference, workshop on "Real Time Multimedia and the Web"). It also proposes directions for future work. It is part of the W3C activity list.
Since the Web was invented, the target audience and the content of a typical Web site has changed dramatically. Originally, the Web was designed as a means for physicists to share the results of their research. For this purpose, simple text documents with hyperlinks were sufficient. Today, Web sites are typically designed for a large consumer audience. The Web is used for catching up on the latest news from Hollywood, or for selecting the next personal computer they will buy.
This sort of content puts the Web into direct competition with traditional print media like journals or brochures. To present content on the Web in the same way as in sophisticated print media, technologies are needed that go beyond the simple hypertext of the original Web design. Many current W3C activities such as graphics, fonts, stylesheets, layout and html are working towards the goal of bringing the quality of Web publication to the level of today's print media. Many industry analysts predict that the content of next-generation Web sites will go far beyond simulating the features of print media. They expect that Web content will become similar to the content of today's multimedia CD-ROMs or even to today's television programs. In other words, they believe that the Web will be turned into a distribution system for both interactive and continous multimedia content, or real-time multimedia content.
Typical CD-ROM products include educational courses, lexica, "virtual" art galleries or CDs of pop groups which are enhaced by text, images and video. Many CD-ROMs contain sections with continious multimedia. One example is a "guided tour". In such an application, the screen shows a sequence of different images, text and graphs, which are explained in an audio stream that is played in parallel. Other examples are videos with overlaid text (subtitles or scrolling text) or multiple videos shown in parallel on the screen.
In many respects, continious multimedia presentations are similar to many television programs. Interestingly enough, the graphic design of some television programs already employs elements of CD-ROM and computer user interfaces - windows pop up in a news broadcast, graphics appear on the screen or program items are presented in menu form. The much-heralded integration of television and computers thus seems to be already on its way, at least on the level of gaphic design of television programs.
Moreover, many television programs are actually multimedia presentations, and could be implemented by using techniques known from multimedia CD-ROMs. Consider the example of a news broadcast: many parts of the television screen contain text, images, graphs or other static elements. The content some of today's television programs could thus be transmitted by sending static information such as background images as seperate files, together with a schedule that determines when the content of these files should be displayed on the screen. Obviously, this requires much less bandwidth than sending the same content as full motion video. Of course, television programs lack the interactivity known from Web and CD-ROM content. For example, a user cannot simply interupt a news transmission by clicking on a photograph to go and search more information about a person being interviewed.
Making interactive and continous multimedia content available on the Web requires a solution to the following problems:
Several different communities are currently working independantly on the integration of real-time multimedia into the Internet, namely the Web community, the CD-ROM community and the community working on Internet-based audio/video-on-demand. Lacking both a forum for discussion/standardisation and a common reference implementation, there is a danger of a plethora of non-interoperable solutions. Competing solutions developed by each of the three communities will show the strengths and weaknesses of each community. An example might be an excellent authoring tool that produces a format that is hard to transmit over the Internet. In such a situation, content providers will hesitate to produce content, and it will be difficult to reach a critcial mass of real-time multimedia content on the Internet.
Therefore, cooperation and standardisation appears to be in the best interest of the different communities working on the integration of real-time multimedia into the Internet. They can combine their respective expertise, and come up with a single, coherent solution. W3C has members from all three communities. Many representatives of each community participated at the recent W3C workshop on "Real Time Multimedia and the Web". In the feedback we received after this event, members of all communities reported that they look to W3C as a promising forum for exchanging ideas and for finding consensus on common solutions for integrating real-time multimedia into the Web. For producing interactive and continious multimedia content, two things need to be added to Web technology : support for creating real time multimedia content, and support for network transmission of this type of content. In the following, we analyse the strengths and weaknesses of each of the communities in each of these areas, give conclusions, describe current work items and discuss ways in which they can be addressed.
A comparison between CD-ROM products and typical Web sites reveals many similarities. On CD-ROMs, the user navigates through the multimedia content using "point-and-click". On the Web, this sort of interface is achieved by using hyperlinks (URLs), possibly associated with image maps. Image and text media types are of equal importance for both Web and CD-ROM content. A point that is much criticized by CD-ROM designers is layout control on the Web. However, formats for layout control on the Web are currently emerging (e.g. the "Layout" tag introduced by Netscape or work done in the W3C Activity on Style Sheets).
Web technology is limited today when it comes to creating continous multimedia presentations. For these applications, content authors need to express things like "five into the presentation, show image X and keep it on the screen for ten seconds". More generally speaking, there must be a way to describe the synchronization between the different media that make up a continous multimedia presentation. On the Web, media synchronisation is currently expressed by using a scripting language such as JavasScript or VisualBasic. However, scripting has a set of well-known disadvantages. Script-based content is often hard to produce and maintain. Moreover, it is hard to build search engines and other automated tools for scripting languages. To address these disadvantages, CD-ROM technology uses declarative formats such as Apple's Quicktime as an alternative approach to well-known scripting languages such as Macromedia's Lingo or Apple's Hypercard. In a declarative language, the "events" in a multimedia presentation are simply expressed by using a timeline, instead of writing a script program. Both approaches have succeeded in the CD-ROM marketplace, and are often used side-by-side in a particular multimedia product. Thus, experience from the CD-ROM community suggests that it would be beneficial to adopt a declarative format for expressing media synchronisation on the Web as an alternative to scripting. After all, HTML is a declarative language, and so is the stylesheets language CSS. HTML content could equally well be programmed in a scripting language. This is not happening, because it has the same disadvantages as using scripting for expressing media synchronisations.
CD-ROM content providers are interested in making their content available on the Internet for several reasons. First, the Internet largely facilitates updating content Therefore, more up-to-date content can be offered via the Internet than on a CD-ROM. Moreover, distribution of content via the Internet is cheaper than distributing CD-ROMs. Also, consumers that use CD-ROMs do not need to buy a new end-user appliance for receiving the same content via the Internet. Both technologies require the same end-user appliance, namely a personal computer. Lost revenue of CD-ROM sales can be recovered ither by using an advertisment-based revenue model, or by emerging technologies for electronic payment (see W3C Joint Electronic Payment Initiative). One approach that has been followed for providing CD-ROM content on the Internet is to package an existing format so that it can be transmitted over the net. Repackaging a CD-ROM format has the advantage that existing authoring tools can be reused to also create content for the Web. The only Web technology commonly reused in most of these approaches is URL-based addressing of Web content. One disadvantage of this approach is that most of the CD-ROM formats are proprietary. In contrast, the Web community is already using open formats that can duplicate some of the functionality of CD-ROM formats (image maps, html, URLs), and is working on formats that will bring the capabilities of the Web even clser to those of CD-ROM formats (layout control, stylesheets, fonts, scripting). More sophisticated authoring tools for Web content are also emerging. Another advantage of Web content is that it is relatively easy to locate information on a particular subject using search engines. Moreover, the Web allows content to be stored on many different servers distributed over the whole Internet. This facilitates the creation of new content by reusing existing content. Two of the most successful Web applications - search engines and Web directories - work because today's Web content is reusable.
Taking these factors together with the forthcoming high-speed Internet links, it seems likely that Web technology will assimilate proprietary CD-ROM content in the near future.
In the last two years, applications allowing to retrieve audio and video content over the Internet have become quite popular (e.g. RealAudio and VDOLive). These applications are currently being extended to also "stream" other media types such as text and images, and not only audio and video. Examples are Microsoft's NetShow, Netscape's MediaServer and work at Progressive Networks on enhancing their RealAudio product. These applications require a format for synchronising different media types contained in a presentation, and standard formats for basic data types such as text and images. Work has already started on designing application-specific formats. However, it appears more attractive to reuse existing solutions from the Web community for text, images and layout, and solutoins from the CD-ROM community for expressing media synchronisation. This allows to leverage existing authoring tools and existing content.
In the area of support for real-time multimedia content, both the Web comunity and the community working on Internet-Based Audio/Video-On-Demand can profit from the experiences gained by the CD-ROM community.
An overlap already exists between the functionality found in Web technology and the functionality found in real-time multimedia technology. Therefore, a reasonable working hypothesis for W3C work in this area is that real-time multimedia will be integrated into the Web by using extensions and additions to the basic Web content technologies for producing content, such as html , JPEG, GIF and PNG images, image maps and URLs.
Other groups working on real-time multimedia make quite different assumptions. For instance, the Java community uses Java byte-code as distribution format for real-time multimedia content, and Java-code generators for producing this content. DAVIC (Digital Audio Video Council) - a standards body for interactive TV - uses MHEG-5 as their base format, and uses only a subset of HTML for implementing hypertext functionality.
This work item requires an analysis of standards such as MHEG or the synrchonisation mechanisms in HyTime. However, given that both technologies have been criticized for being relatively complex, and have up to now failed to gain widespread market acceptance, it might turn out that another solution is more appropriate for the Web and its design philosophy. The Quicktime file format appears to be a good starting point, since it is freely available and implemented on a wide range of platforms. Constraint-based approaches are another interesting solution that should be further evaluated.
Addressing Subparts of Audio/Video Files via URL's
In many applications, it is useful to have random access to large audio and video files in a similar to the "edit-lists" used in audio/video editing systems. This requires that subparts (or "clips") of audio/video files must be addressable. On the Web, such addresses could be constructed by using URLs that describe a time range, e.g. "the clip in audio file F starting at minute 5 and ending at minute 10".
This includes for example formats for synthesized sound like midi and formats for sprite-based animation.
The current transport protocol for web content, HTTP, is not the first choice when it comes to transporting audio and video content. This is due to the fact that HTTP delivers content via TCP. For numerous reasions, it is difficult to deliver audio and video content over TCP without degrading the user experience significantly. Given the importance of audio and video for continious multimedia applications, support for a protocol that has been designed with the transport of audio and video in mind is required.
A current trend on the Internet are "push-based" applications like Pointcast. In these applications, a collection of Web-like pages is sent out on "channels". Rather than retrieving each page explicitly, consumers switch into a channel and receive the whole collection of content without further interaction, similar to a TV program. For this sort of "push" or "broadcast" applications that transmitt to large audiences, much experience exists in parts of the community that developed streaming of audio and video content over the Internet. IP multicast can solve many of the network problems occuring in such an application, and is an attractive alternative to establishing and maintaining a seperate network of "repeaters" or "reflectors" for each particular broadcast application.
A difference between the Web and CD-ROM technolgy is the medium over which content is transmitted. The bandwidth delivered by a CD-ROM drive is much higher than the average bandwidth in wide parts of the current Internet. Moreover, the Internet is a harsh environment for transmitting real-time data streams. Its "best-effort" service implies that there is no guaranteed bandwidth, that the transmission delay of a packet can vary arbitrarily and that packets can be lost in transmission.
However, the problem of using the Internet may not be as serious as it seems at first sight. The bandwidth limitation appears solvable. Even today, many CD-ROM applications do not require continious access to the CD-ROM drive. Moreover, CD-ROM designers are already experienced in designing their content around bandwidth limitations, using techniques such as prefetching of content. Higher Internet bandwidth is becoming availaibe in the form of IP access over television cable networks or satellites. Finally, the community working on streaming of audio and video over the Interent has developed solutions to overcome the problems of variable transmission delay (playout buffers), packet loss (forward error correction) and variable bandwidth (adaptive coding).
Web-based audio/video-on-demand, i.e. access to stored audio and video resources, became much more practical due to the advent of "streaming protocols" which replace the standard Web transport protocol http when audio and video data must be transferred. Many technologies developed in this area can be reused to support the transmission of real-time multimedia over the Internet.
In the area of support for real-time transporting multimedia content over the Internet, both the Web comunity and the CD-ROM community can profit from the experiences gained by the community working on Internet-Based Audio/Video-On-Demand.
Given that Web content is primarily transported over the Internet, W3C work on support for transmitting real-time multimedia concentrates on this network. Other groups working on real-time multimedia follow different assumptions. For instance, DAVIC recommendations are primarily targeted to distribution networks based on ATM and MPEG system streams distributed via cable and satellites systems for delivering interactive TV content.
Moreover, given the difficulties of changing the basic Internet infrastructure, it seems safe not to base solutions on novel techniques such as the bandwidth reservation protocols RSVP or IPv6. An exception is IP multicasting, which seems to be close to being used in commercial products, as indicated by the fact that both Microsoft's NetShow and Progressive Network's extensions to RealAudio allow content multicasting, and by the recent announcement of multicast support by Microsoft Network.
Addressing "Streaming" Audio/Video resources via URL's
Switching between HTTP and a real-time protocol currently requires one additional network round-trip time in order to retrieve a "session file" or "metafile". This file is then typically passed to a helper application. The extra round-trip delay for retrieving the description file increases network delay, and thus the response time observed by the user. Moreover, configuring helper applications and MIME-types is cumbersome for the end-user.
In principle, this problem can be solved by defining a URL-scheme for the respective real-time transport protocol (e.g. "rtp://") analogous to URL schemes for ftp, news etc. The problem becomes tricky, however, when moving from monomedia streams, such as audio, to multimedia streams. Here, several seperate transport connections must be opened in parallel to receive the stream, which is difficult to express using a single URL. If the content is stored on a single server, the "wildcard" approach used in RTSP' can be reused. If the content is stored on multiple servers, retrieving a session file can probably not be avoided, unless one wants to allow URL-schemes that address several servers in "parallel".
Application Level Framing for Web Data Formats
The IETF has developed RTP (Real Time Transport Protocol) as the standard for carrying data for real-time multimedia applications over the Internet. This protocol is also used to trasnport audio and video in the H.323 conferencing standard.
One of the design principles behind this protocol is that real-time data should be split into packets in such a way that each packet can be processed independantly by the receiver application (application-level framing). This greatly facilitates synchronizing real-time multimedia streams when there is packet loss on the Internet. As an example, consider transmitting an html page that is synchronized with an audio stream. With application-level framing, packet loss in the html transmission will not lead to an interuption in the real-time output, but only to a "hole" in the displayed HTML page.
Packetization schemes for HTML that allow displaying html using application level framing techniques can be derived from similar techniques that have been developed for shared editing of SGML documents. W3C has recently received a concrete proposal for solving this problem, which will very probably be turned into a W3C note. A similar packetization scheme is required for GIF images. For JPEG images, it should be possible to reuse the existing RTP payload format for MJPEG ("moving JPEG").
Potential W3C products in the area of real-time multimedia are:
Many of the ideas contained in this document are based on presentations and discussions at the W3C workshop on "Real-time multimedia and the Web". Special thanks goes to the workshop participants that replied to our requrest and gave us extensive written feedback on the future directions of W3C work: