Roy H. Campbell
The easy-to-use, point-and-click user interfaces of WWW browsers, first popularized by NCSA Mosaic, have been the key to the widespread adoption of HTML and the World Wide Web by the entire Internet community. Although traditional WWW browsers perform commendably in the static information spaces of HTML documents, they are ill-suited for handling continuous media, such as real time audio and video. Research in our laboratory has resulted in Vosaic, short for Video Mosaic, a tool that extends the architecture of vanilla NCSA Mosaic to encompass the dynamic, real time information space of video and audio. Vosaic incorporates real time video and audio into standard Web pages and the video is displayed in place. Video and audio transfers occur in real time; there is thus no retrieval latency. The user accesses real time sessions with the familiar ``follow-the-link'' point and click method. Mosaic was chosen as the software platform for our work because it is a widely available tool for which the source code is available, and with which we have significant experience.
Vosaic is a vehicle for exploring the integration of video with hypertext documents, allowing one to embed video links in hypertext. In Vosaic, sessions on the Multicast Backbone (Mbone) can be specified using a variant of the Universal Resource Locator (URL) syntax. For example, the URL
encodes a Mbone transmission with a Time To Live (TTL) factor of 127, a multicast address of 22.214.171.124, a port address of 4739, and an nv video transmission format.
While our original intent was for Vosaic to support the navigation of the Mbone's information space, we have extended Vosaic to include real time retrieval of data from arbitrary video servers. Vosaic now supports the streaming and display of real time video, video icons and audio within a WWW hypertext document display. The Vosaic client adapts to the received video rate by discarding frames that have missed their arrival deadline. Early frames are buffered, reducing playback jitter. Periodic resynchronization adjusts the playback to accommodate network congestion. The result is real time playback of video data streams.
Present day httpd servers exclusively use the TCP protocol for transfers of all document types. Our experiments indicate that real time video and audio data can be effectively served over the present day Internet with the proper choice of transmission protocols. The server uses an augmented Real Time Protocol (RTP) called VDP with built-in fault tolerance for video transmission. Section 6 describes VDP. Feedback within VDP from the client allows the server to control the video frame rate in response to client CPU load or network congestion. The server also dynamically changes transfer protocols, adapting to the request stream and the meta-information in requested documents. Our initial experiments show a forty-four-fold increase in the received video frame rate (0.2 frames per second (fps) to 9 fps) with VDP in lieu of TCP, with a commensurate improvement in observed video quality. We describe the implementation and results of our experiments below.
In section 2, we enumerate the reasons why the current WWW architecture is highly unsuitable for continuous media, and discuss how video and audio may be effectively incorporated into the current WWW in section 3. Then in section 4, we describe the architecture of our prototype WWW client Vosaic, and in section 5, we describe the extended HTTP server that we have constructed. Our specialized video datagram protocol VDP is discussed in section 6. Experimental results are presented in section 7. Related work is discussed in section 8 and we give our conclusions in section 9.
However, the real time transfer of multimedia data streams introduces new problems of maintaining adequate playback quality in the face of network congestion and client load. As the WWW is based on the Internet, resource reservation to guarantee bandwidth, delay or jitter is not currently possible (we discuss our work in relation to the RSVP Internet resource reservation protocol in section 8). The delivery of IP packets across the international Internet is typically best effort, and subject to network variability outside the control of any video server or client.
Our initial effort focuses on supporting real time video in the Web. Inter-frame jitter greatly affects video playback quality across the network (for our purposes, we take jitter as the variance in inter-arrival time between subsequent frames of a video stream). High jitter typically causes the video playback to appear ``jerky.'' In addition, network congestion may cause frames delays or losses. Transient load at the client side may prevent it from handling the full frame rate of the video. We created a specialized real time transfer protocol for handling video across the Internet. Our experiments indicate that the protocol successfully handles real time Internet video by reducing jitter and incorporating dynamic adaptation to the client CPU load and network congestion.
There is one argument that the functions of Mosaic should be kept at a minimum while additional functionality is left up to specialized external viewers. Work along this line has led to the development of the Common Gateway Interface, a standard for harnessing external programs to view files according to their MIME[9,8] types. One key reason favoring the success of Mosaic is its incorporation of inlined images with text, resulting in a semantically rich environment. Leaving the display of audio and video to external viewers causes a loss of semantic content in a hypertext document that includes these media types. Real time video and audio convey more information if directly incorporated into the display of a hypertext document. For example, we have implemented real time video menus and video icons as an extension of HTML in Vosaic. Figure 1 is a typical four-item video menu one may construct with Vosaic. Video menus present the user with several choices. Each choice is in the form of a moving video. One may, for example, click on a video menu item to follow the link, and watch the clip in full size. Video icons show a video in an small, unobtrusive icon-sized rectangle within the HTML document. Our experience indicates that embedded real time video within WWW documents greatly enhances the look and feel of a Vosaic page. Video menu items convey more information about the choices available than simple textual descriptions or static images.
Figure 1: A four-item video menu in Vosaic.
HTML documents with video and audio integrated are characterized by a variety of data transmission protocols, data decoding formats, and device control mechanisms (eg., graphical display, audio device control, and video board control). Vosaic has a layered structure to meet these requirements. The layers are depicted in figure 2. They are:
As discussed in the previous section, TCP is only suitable for static document transfers, such as text and image transfers. Real time playback of video and audio requires other protocols. The current implementation in the Vosaic document transmission layer includes TCP, VDP and RTP. Vosaic is configured to have TCP support for text and image transmission. Real time playback of real time video and audio uses VDP. RTP is the protocol used by most Mbone conferencing transmissions. A fourth protocol under consideration is for interactive communication (used for virtual reality, video games and interactive distance learning) between the web client and server.
The decoding formats currently implemented include:
The display layer includes traditional HTML formatting and inline image display. We have extended the display to incorporate real time video display and audio device control.
Examples are given below:
The first URL encodes an Mbone transmission on the address 126.96.36.199, on port 4739, with a time to live factor of 127, using nv format video. The second and third URLs encode continuous media transmissions of MPEG video and audio respectively.
Incorporating inline video and audio in HTML necessitates the addition of two more constructs to the HTML syntax. The additions follow the syntax of inline images closely. Inlined video and audio segments are specified as follows:
The syntax for both video and audio is made up of a src part and an options part.
If the control keyword is given, a control panel is presented to the user. The control interface allows users to browse and control video clips. We provide following user control buttons.
Vosaic works in conjunction with an extended HTTP server. The server uses the same set of transmission protocols as does Vosaic. It spawns a child to handle the transmission of each video stream. Video and audio are transmitted with VDP. Frames are transmitted at the originally recorded frame rate of the video. The server uses a feed forward and feedback scheme to detect network congestion and automatically delete frames from the stream in response to congestion.
The main components of the server are shown in figure 4. They are:
Traditional HTTP servers can do without admission control because document sizes are small, and request streams are bursty. Requests are simply queued before service, and most documents can be handled quickly. In contrast, with continuous media transmissions in a video server, file sizes are large, and real time data streams have stringent time constraints. The server must ensure that it has enough network bandwidth and processing power to maintain service qualities to current requests. The criteria used to evaluate requests may be based on the requested bandwidth, server available bandwidth, and system CPU load. Our current system simply limits the number of concurrent streams to a fixed number. However, the admission control policy is flexible and can be made more sophisticated.
Protocols for the transmission of continuous media over the Internet must preserve network bandwidth as much as possible. If a client does not have enough processor capacity, it may not be fast enough to decode video and audio data. Network connections may also impose constraints on the frame rate at which video data can be sent. In such cases, the server must gracefully degrade the quality of service. The server learns of the status of the connection from client feedback. Feedback messages are of two types:
The frame drop rate corresponds to frames received by the client but dropped because the client did not have enough CPU power to keep up with decoding the frames.
The packet drop rate corresponds to frames lost in the network due to network congestion.
In response to a video request, the server begins by sending out frames using the recorded frame rate. The server inserts a special packet in the data stream indicating the number of packets sent out so far. On receiving the feed forward message from the server, the client may then calculate the packet drop rate. The client returns the feedback message to the server on the control channel. In our implementation, feedback occurs every 30 frames. Our experiments indicate that adaptation occurs very quickly in practice --- on the order of a few seconds.
In VDP, the responsibility of determining which frames are resent is put on the client based on its knowledge of the encoding format used by the video stream. In an MPEG stream, it may thus choose to request retransmissions of only the I frames, or of both the I and P frames, or all frames. VDP employs a buffer queue at least as large as the number of frames required during one round trip time between the client and the server. The buffer is full before the protocol begins handing frames to the client from the queue head. New frames enter at the queue tail. A demand resend algorithm is used to generate resend requests to the server in the event a frame is missing from the queue tail. Since the buffer queue is large enough, it is highly likely that resent frames can be correctly inserted into the queue before the application requires it.
Table 1: MPEG test movies.
The videos ranged from a short 14 second segment to one of several minutes duration.
In order to observe the playback video quality, we based the client side of all our tests in our laboratory in Urbana, Illinois, USA. In order to cover the widest possible range of configurations, we set up servers corresponding to local, regional and international sites relative to our geographical location. We used a server at the National Center for Supercomputing Applications (NCSA) for the local case. NCSA is connected to the local campus network via Ethernet. For the regional case, we used a server at the University of Washington on the east coast of North America, in Washington State. Finally, a copy of our server was set up at the University of Oslo in Norway to cover the international case. Table 2 lists the names and IP addresses of the hosts used for our experiments.
Table 2: Hosts used in our tests.
Table 3: Local test.
Table 4: Regional test.
Table 5: International test.
Tables 3, 4, and 5 show the results for sample runs using the test videos by our Web client accessing the local, regional and international servers respectively. Each test involved the Web client retrieving a single MPEG video clip. We used an unloaded SGI Indy as the client workstation. The numbers give the average frame drop percentage and average application-level inter-frame jitter in milliseconds for thirty test runs. Frame rate changes due to the adaptive algorithm was seen in only one run. That run used the puffer.mpg test video in the international configuration (Oslo, Norway to Urbana, USA). The frame rate dropped from 5 fps to 4 fps at frame number 100, then increased from 4 fps to 5 fps at frame number 126. The rate change indicated that transient network congestion caused the video to degrade for a 5.2 second period during the transmission.
The results indicate that the Internet can in fact support a video-enhanced Web service. Inter-frame jitter in the local configuration is negligible, and below the threshold of human observability (usually 100 ms) in the regional case. Except for the puffer.mpg runs, the same holds true for the international configuration. In that case, the adaptive algorithm was invoked because of dropped frames and the video quality was degraded for a 5.2 second interval. The VDP buffer queue efficiently reduces frame jitter at the application level.
Our last test exercised the adaptive algorithm more strongly. We used the local configuration and retrieved a version of smalllogo.mpg recorded at 30 fps at a pixel resolution of 320 by 240. This is a high quality video clip at a medium size, requiring significant computing resources for playback. Figure 5 shows a graph of frame rate versus frame sequence number for the server transmitting the video.
Figure 5: Frame rate adaptation for smalllogo.mpg.
The client side buffer queue was set at 200 frames, corresponding to about 6.67 seconds of video. The buffer at the client side first fills up, and the first frame is handed to the application at frame number 200. The client workstation does not have enough processing capacity to decode the video stream at the full 30 fps rate. The client side protocol detects a frame loss rate severe enough to report to the server at frame number 230. Our current criteria for degrading the transmission is when the frame loss rate exceeds 15%. The transmission is upgraded if the loss rate is below 5%.
The server begins degrading its transmission at frame number 268, that is, within 1.3 seconds from the client detecting that its CPU was unable to keep up. The optimal transmission level was reached in 7.8 seconds, corresponding to a 9 frame per second transmission rate. Stability was reached in a further 14.8 seconds. The deviation from optimal did not exceed 3 frames per second in either direction during that period. The results show a fundamental tension between large buffer queue sizes that reduce jitter and server response times.
The test with very high quality video at 30 fps with a frame size of 320 by 240 represents a pathological case. However, the results show that the adaptive algorithm is an attractive way to reach optimal frame transmission rates for video in the WWW. The test implementation changes the video quality by 1 frame per second at each iteration. We are experimenting with other schemes based on more sophisticated policies, such as multiplicative-decrease/additive-increase.
Sun Microsystem's HotJava product introduced a novel method for the inclusion of animated multimedia in a Web browser. HotJava allows the browser to download executable scripts written in the Java programming language. The execution of the script at the client end enables the animation of graphic widgets within a Web page. In contrast, we have concentrated on the streaming of real time video and audio over the WWW.
Smith's CM Player is a networked continuous media toolkit. It is used as part of an experimental video-on-demand (VOD) system at UC Berkeley. CM Player is a stand alone system. In contrast, we have cast our work in a WWW context, incorporating continuous media into Web browsers and servers. CM Player employs Cyclic-UDP for the transport of video streams between the VOD server and the CM Player client. Frames are prioritized at the server, and clients request resends on detecting frame losses. Cyclic-UDP repeatedly resends high priority frames in order to give them a better chance of reaching the destination. VDP's demand resend algorithm is similar to Cyclic-UDP, except that the client decides which frames get retransmitted. In an MPEG transmission, the client can decide to tolerate the loss of B frames but require the resend of all I frames.
The proposed RSVP protocol will allow resource reservation in Internet routers and thus guarantee quality of service to the data flows of requesting applications. VDP, in contrast, is designed to preserve network bandwidth in response to both network congestion as well as client CPU load. VDP attempts to reach the best level of service possible under current conditions, thus minimum guaranteed bandwidths are neither necessary nor appropriate.
Since the completion of the initial draft of this paper and the release of the Vosaic browser, we have learnt of the Distributed Real-Time MPEG Video Audio Player from Cen, Pu, Staehli, Cowan and Walpole. While their system is a stand alone distributed MPEG player like Berkeley's CM Player, we share many ideas in common regarding feedback in order to preserve network bandwidth. It is a pleasure to cite them in this paper.
Our work indicates that a video-enhanced World Wide Web is indeed possible.
As continuing work, we are implementing annotations in the browser and server. Annotations are a series of polygons that are added to a video stream with the aid of an annotation editor. The annotations accompany a video stream when it is transmitted. The system interpolates between polygons and users are thus able to click on parts of a moving video in order to follow hyperlinks.
2. World Wide Web Consortium. Hypertext Markup Language. <URL:http://www.w3.org/hypertext/WWW/MarkUp>.
3. World Wide Web Consortium. Hypertext Transfer Protocol. <URL:http://www.w3.org/hypertext/WWW/Protocols>.
4. World Wide Web Consortium. Uniform Resource Locators. <URL:http://www.w3.org/hypertext/Addressing/URL>.
5. D. Le Gall. MPEG: A Video Compression Standard for Multimedia Applications. Communications of the ACM, 34(4):46--58, April 1991.
6. S. Deering and D. Cheriton. Multicast Routing in Datagram Internetworks and Extended LANS. ACM Transactions on Computer Systems, pages 85--110, May 1990.
7. J.C. Mogul. Operating Systems Support for Busy Internet Services. In Fifth Workshop on Hot Topics in Operating Systems, Orcas Island, WA, May 1995. IEEE Computer Society.
8. K. Moore. MIME (Multipurpose Internet Mail Extensions) Part Two: Message Header Extensions for Non-ASCII Text. Internet RFC 1522, September 1993.
9. N. Borenstein and N. Freed. MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies. Internet RFC 1521, September 1993.
10. Netscape Communications, Inc. Netscape. <URL:http://www.w3.org/hypertext/Addressing/URL>.
11. R. E. McGrath. What We Do and Don't Know About the Load on the NCSA WWW Server. <URL:http://www.ncsa.uiuc.edu/InformationServers/Colloquia/28.Sep.94/Begin.html>, September 1994.
12. Ron Frederick. Experiences with software real time video compression. Technical report, Xerox Palo Alto Research Center, July 1992. <URL:ftp://parcftp.xerox.com/pub/net-research/nv-paper.ps>.
13. H. Schulzrinne and S. Casner. RTP: A Transport Protocol for Real time Applications. Internet Draft, October 1993.
14. Brian Smith. Implementation Techniques for Continous Media System and Applications. PhD thesis, University of California, Berkeley, 1993.
15. Sun Microsystems, Inc. Hot Java. <URL:http://www.sun.com>.
16. Thomas T. Kwan and Robert E. McGrath and Daniel A. Reed. User Access Patterns to NCSA's World Wide Web Server. Technical report, University of Illinois at Urbana-Champaign, 1995. <URL:http://www-pablo.cs.uiuc.edu/Projects/Mosaic/mosaic.html>.
17. Trans-European Research and Education Networking Association. WAIS. <URL:http://www.earn.net/gnrt/wais.html>.
18. World Wide Web Consortium. Common Gateway Interface. <URL:http://www.w3.org/hypertext/WWW/Overview.html>.
19. Shanwei Cen, Calton Pu, Richard Staehli, Crispin Cowan and Jonathan Walpole. A Distributed Real-Time MPEG Video Audio Player. In Fifth International Workshop on Network and Operating System Support of Digital Audio and Video (NOSSDAV'95). April 18-21, 1995. Durham, New Hampshire, USA.
20. L. Zhang, S. Deering, D. Estrin, and D. Zappala. RSVP: A New Resource ReSerVation Protocol. IEEE Network, September 1993
See-Mong Tan [http://choices.cs.uiuc.edu/srg/stan/stan.html] is a Ph.D. candidate at the University of Illinois at Urbana-Champaign. He received a B.S. from the University of California at Berkeley in 1989 and a M.S. from the University of Illinois at Urbana-Champaign in 1991. He is a member of the Systems Research Group. His research interests include high speed networks and multimedia operating systems. His email address is email@example.com.
Roy H. Campbell [http://www.cs.uiuc.edu/CS_INFO_SERVER/DEPT_INFO/CS_FACULTY/FAC_HTMLS/campbell.html] is a professor at the University of Illinois at Urbana-Champaign and director of the Systems Research Group in the Department of Computer Science. He received a B.Sc. in Mathematics and Physics from the University of Sussex in 1969, an M.Sc. and Ph.D. in Computing from the University of Newcastle upon Tyne in 1972 and 1976 respectively. The focus of Professor Campbell's research is complex systems software. He has been with the University of Illinois at Urbana-Champaign since 1976. His email address is firstname.lastname@example.org.
Yongcheng Li [http://choices.cs.uiuc.edu/srg/ycli/public_html/self.html] is a Ph.D. candidate at the University of Illinois at Urbana-Champaign. He received his B.S. and M.S. in Computer Science from Tsinghua University, the People's Republic of China. He is a member of the Systems Research Group. His current research interests include operating systems, multimedia, and distributed systems. His email is address is email@example.com.