Why are Live Media Still Uncommon ? - Are Browser Developers Blind ?

Heiner Wolf
wolf@informatik.uni-ulm.de
Distributed Systems Department
89069 University of Ulm, Germany

In the beginning everything was static. Static pages and static images, which were loaded one after the other. Netscape introduced the first dynamic elements. Netscape's client decoded objects while the data was arriving. Incremental presentation seemed to be the obvious way to shorten the waiting time, at least its subjective perception. And even more: incremental presentation seemed to me to be the future of the web. What would happen if an incrementally decoded object could replace parts of itself? An animation fed by the server would be presented in place of a static object.

HTTP uses reliable transmission and this does not go with live media. But on the other hand it is possible to get reasonable throughput for megabytes of FTP data. This means that on network connections with low packet loss at least graphics animations do work over TCP. Nevertheless, UDP should be used for real realtime multimedia data.

External applications and some time later plug-ins were implemented, which simply opened a UDP port, decoded and presented the data that arrived at this port. Lots of such additions to web clients appeared for audio and video. This was no surprise, because anyone who ever implemented the transmission of multimedia started exactly at the same point: a UDP port and a decoder behind it. It seemed to be only a question of time until this port and its decoders would be built into the clients, such as decoders for static objects are built in.

But it didn't happen. Progressive Network's RealAudio took over the market of audio transmission. They 'forgot' to publish all of their protocols. So their widely deployed client software could not become a defacto standard. They got rich, and that's all that matters. VDOLive and others tried the same for video, but they failed. The Java hype began. Java promised to solve every problem. It also promised to make any platform dependent, proprietary, non-standard, hardcoded handling of multimedia data obsolete. What Java applets really do is: open a UDP port, receive data and feed it into the platform dependent, hardcoded multimedia decoders, which are only wrapped by Java classes.

At least Netscape has always been open to extensions and experiments. They implemented the server-push mechanism for a poor man's TCP based animation. I dare say they would have created a new internet standard for realtime transmission, if they had built in the UDP port mentioned above instead.

Was it just that browser developers were busy fighting each other in more important fields? Whatever prevented them from doing the right thing for multimedia, the result is that live media transmission is still uncommon. Or can your browser be used as a picturephone? I think it should be.

Lessons from the Short History of Animations

There is already a number of ways to integrate multimedia presentations or animations into web pages. Shockwave from Macromedia was the earliest one. It is used by many sites including some big commercial ones, but is far from being a standard. For a long time Shockwave was not suitable for live media, because presentations had to be downloaded first. The lesson is that incremental decoding is essential.

Movies can be embedded into pages. The Quicktime plug-in for Netscape even promises to start presenting before the entire file has been loaded. This ist what we need for live media. One could encode video and audio from live sources into a Quicktime data stream, but this approach bears major drawbacks. Data is transmitted over TCP because every bit has to be in the right place and it just did not work from the start, because the plug-in told me that it is missing Quicktime for Windows. It asked me to download some megabytes from the Apple site over the modem line. So I went on without watching the embedded movie. The lesson: software has to be ready; integrated into or deployed with the client.

GIF is the most popular way to include small animations into web pages. And this fact will last for a while. Initially GIF was only for static images, because Netscape like all other browser developers did not have implemented the entire GIF standard. But after we convinced them to implement the last bit, which essentially was only a minor change, GIF fulfilled all requirements:

It is a simple format and a well known standard.
Decoding starts before the data has been downloaded entirely.
No additional software installation is required. Receiver and decoder are deployed with the browser. Actually they are integrated, as you know.

Media Handlers not Architectures

We don't need a realtime multimedia architecture as it is currently developed by several big players. We don't need frameworks such as LiveMedia or ActiveMovie. What we need is a simple way to request media streams and to decode them by media handlers. We may add some sort of feedback from the data sink to the source, if neccessary. Doesn't this sound familiar? Remember the HTTP protocol specifies a simple way to request and retrieve objects over reliable connections. The feedback capability is built into TCP.

What is the difference between static and realtime multimedia data? Some people point out that realtime is totally different. But the main difference is that the retransmission and backoff mechanisms of TCP hurt multimedia data streams. So why not just replace the reliable transmission by its unreliable brother UDP? Nothing else is required for an internet style, best effort approach. We need

a request mechanism
transmission, reception and decoding rules
a feedback capability.

We should keep those simple and flexible. Looking at HTTP and the first ASCII based browser three years ago, would you have imagined that the web would develop as it did? HTTP and its associated standards are a striking example for the capacity of a simple and open approach.

I Propose

to agree on TCP (or UDP) and a so-called well-known port for the request,
to define a simple ASCII based syntax for the request,
to define a simple header for multimedia data or just use the RTP header,
to agree on a small set of data types to start with,
to define a set of redundancy methods for multimedia data,
to define a very simple feedback mechanism.

The next chapter shows possible solution to these points. Far from presenting a complete solution here, I'm going to show examples in order to explain this proposal and how simple things could be.

Examples

Request Protocol

Requests should be reliable. TCP is the first choice here. On the other hand, the simple HTTP way of requesting single objects proved to be disadvantageous because the way people use the web has changed since the early days. We cannot predict what will happen to multimedia in the future. But we have learned from HTTP that requests must be cheaper than a transport system connection. What abut UDP based requests? They may disappear while being transmitted. In this case we could send some more requests, say three. If they all fail it is probably a bad connection and one would not get multimedia data through it anyway, maybe video, if the frames are selfconsistent, but not highly compressed audio, which relies on a proper decoder state. Even TCP would not do the job at such high IP packet loss. I don't want to discuss his further, but UDP is definitely a candidate. In past video conferencing projects we got positive results with UDP based control protocols for multimedia exchange [Dermler94].

Request Syntax

The advantage of TCP is that web clients can already issue requests over TCP to any port. TCP combined with an HTTP-compatible request syntax would make the adoption easier, because the requests would not have to be implemented in clients, just the reception. Schulzrinne proposed such a protocol which is similar to HTTP [Schulzrinne96].

In a first implementation for the EU Telematics project CoBrow we are using plain HTTP to access media streams, e.g. a "GET /stream/audio/gsm" request from an audio live source. A simple "GET /stream/audio" with "Accept: audio/x-gsm" attribute would be more appropriate, but the header fields of HTTP-requests can usually not be manipulated.

Protocols which consist of collections of attributes are in general very flexible and open to extensions. One could even dismiss the HTTP header line from HTTP and put the information in some additional request header fields.

Packet Header

The RTP data packet header can be adopted for this purpose. It is used by many of the MBONE tools and by some internet phones. It is as good as any other packet header that contains the payload type and a timestamp and/or a sequence number. However a packet header should be as short as possible to reduce the overhead especially for audio streams. Highly compressed audio streams with small (20 - 50 ms) frames consist of very short packets (down to 12 bytes). UDP and IP headers add up to 32 bytes. This is already too much.

Data Formats

I think a WWW multimedia platform should be compatible with the MBONE. With its mechanisms for media transmission and - more important - some of the data types used on the MBONE. GSM is a candidate for audio compression although it has been designed to survive bit errors and not packet loss. One still has to agree on a stream format.

But data types are not the key issue here. The important point is that realtime media handlers act exactly like their non-realtime counterparts. The only difference is that realtime media handlers get data from datagrams while non-realtime handlers process data of reliable streams.

Data Redundancy

On the internet bit errors are very unlikely. Packet loss is the big problem. Forward error correction therefore must supply the data stream with additional information which allows reconstruction of lost packets. We achieved very good results with different levels of redundancy. The first level is an additional packet for two data packets which carries the logical XOR of both. This adds 50 % of the data rate (say 12 kbit/s instead of 8 kbit/s audio), but makes a stream very robust against small packet loss rates up to 5 % (99 % probability for successful reconstruction of the two data packets). Higher packet loss demands for more redundancy. A forth packet carrying the orthogonal combination promises 97 % data preservation at 20 % packet loss [Lamparter93]. Higher order combinations can be used to finetune the redundancy between one and two times the original data rate.

Statistics show that on the internet very often bursts of packets are lost. This requires additional mechanisms to improve the probability for reconstruction of lost packets. Many media stream formats contain bit sequences of different importance. If they are split into separate packets forward error correction can be applied selectively.

Feedback

Feedback from the media sink to the source must indicate lost packets. Transmission of received packet numbers or timestamps from the sink to the source is a simple feedback mechanism. The source can then adapt the data rate, add redundancy packets or can even retransmit selected packets.

I propose to use UDP also for feedback information. This avoids stalling of the feedback connection and it allows the feedback stream to be ignored easily by simple sources. The sink is able to recognize packet loss if sequence numbers are missing. Therefore feedback can also be submitted to forward error correction.

Implementation

Costs

Sample implementations for all of these components have been done in our working group in some form. The implementation of a simple HTTP-style request protocol takes just a day or two. Integration of available decoders requires some weeks of work, while efficient methods to realize feedback and redundancy which survive packet loss, take a little bit more work. But many of these components are already available in the MBONE tools. Just an example: a student's practical work which was finished recently was the implementation of a GSM and UDP based phone service for MS Windows. It took only a few weeks although the student was unexperienced when he started.

A Picturephone

What is required for a picturephone? A client with realtime media handlers for audio and video, of course; small programs for audio and video live sources on the personal workstation; and a compage, the dynamic equivalent of the yet static homepage. A live video will appear in place of the image which people have on their homepage. A picturephone call is as simple as accessing a page on the web. A two-way connection is established, if both participants watch each other's compage. Replying to a call can of course be simplified and automated by supplying the URL of the caller's compage as an attribute of the requests for live media.

Figure 1: Picturephone built of two partys' compages.

References

[Dermler94]: G. Dermler, T. Gutekunst, E. Ostrowski, F. Ruge: "Sharing Audio/Video Applications Among Heterogeneous Platforms", Proceedings, 5th IEEE COMSOC Int. Workshop on Multimedia Communications, Kyoto, 1994.
[Lamparter93]: B. Lamparter, O. Böhrer, W. Effelsberg, V. Turau: "Adaptable Forward Eror Correction for Multimedia Data Streams", Reihe Informatik 9/93, University of Mannheim, 1993.
[Schulzrinne96]: H. Schulzrinne: "Personal Mobility for Multimedia Services in the Internet", Lecture Notes in Computer Science 1045, Springer 1996.