paper-CSELT-WebTV

Outlook of Standards for Multimedia Applications

Pietro Marchisio

CSELT =96 Centro Studi e Laboratori Telecomunicazioni S.p.A.

1. Introduction

The convergence of the Internet and digital TV is attracting major interests at the level of multimedia applications, which are expected to bridge the gap between the interactive and the broadcast domains by offering new services to users.

The EPG (Electronic Program Guide) is one of the reference applications that will be available in this converging scenario. A Web based EPG can enable the user to browse through an up-to-date overview of present and future program schedules. When he selects a TV program for viewing, he enters the broadcast domain, and can also easily zap across different programs. While watching a program, he is able to ask complementary information, which means moving back to the Web to get the proper pages containing text, graphics, animations and so on. The distinction between the interactive and the broadcast paradigms will be transparent to the user, since the same "multimedia scene" would offer a blend of broadcast video and other information available from the Web.

Such applications as the EPG consist of sets of self-contained objects based on synchronization and spatial-temporal relationships of multiple media formats, structural composition, "event-action" associations, navigation and user interaction. Controlling the playback of time-dependent contents, like streams of multiplexed audiovisual data, requires specific support. These streams demand VCR control functions (play, pause, and so on), as well as capability to manage events generated during presentation. For example, the playback of a stream might synchronize the rendering of text subtitles and graphics animations. More advanced features might be required in specific contexts. These include virtual reality, 3D animations, and real-time text-to-speech synthesis.

Standards are essential for delivering applications across heterogeneous platforms. The concept of "multimedia standard" implies a declarative format to represent an audiovisual scene as a composition of audio and visual objects with specific properties and behavior, in both dimensions of space and time. There are at the moment three major "standard solutions": SMIL, MHEG-5, and MPEG-4 System. They originate from different communities and address a somewhat different "multimedia profile". However, the bundle of concepts they share makes them suitable to convey EPG and similar Interactive TV applications to users.

CSELT - the Telecom Italia Group=92s Research Center - is playing an active role in the specification of multimedia standards. The efforts include development and testing of tools and applications to be used in the realm of new services, such as those that will emerge from the integration between TV and the Web.

This contribution provides an outlook of the three major multimedia standards, and highlights the need of having an adequate authoring tool to generate compliant applications.

2. SMIL

SMIL is a proposed recommendation developed by the W3C Synchronized Multimedia (SYMM) Working Group. It aims to bring synchronized multimedia to the Web without requiring expensive authoring tools.

2.1. Features

Using SMIL, authors can create a multimedia presentation in terms of layout, temporal behavior, and events associated to timers and anchor selections. A SMIL document essentially includes a "layout" and a "body".

The layout specifies a set of "regions" on the screen. A region is referred to from elements within the body. The author can select between SMIL "basic layout", and expressing layout in CSS2 (Cascading Style Sheet).

The body contains information related to the temporal and linking behavior of the document. A document=92s structure relies on a tree-like aggregation of "synchronization elements", which type can be sequential or parallel. These elements synchronize the presentation of their children. At the leafs there are "Media object elements", which manage the presentation of animation, audio, image, video, text and text-stream elements. Element=92s activation can be deferred to the occurrence of an event (e.g. generated by a video clip). Links are actuated only by the user, and support navigation between objects. A link can activate a new document or a single object within a document.

A further feature, helpful in the Web environment, is the "switch" element. It allows indicate a list of alternatives to the player, which would select only the first alternative satisfying a given test (e.g. system language, system bitrate, and so on). This enables authors to indicate that an audio track is available in several languages, that it is encoded for different bitrates, and so on.

SMIL documents are XML 1.0 documents, and the syntax is defined using the DTD (Document Type Definition) notation. The formalism is human readable.

2.2. Current status

SMIL 1.0 is currently a Proposed W3C Recommendation. For most of the features, interoperability has been demonstrated by means of independently developed implementations.

3. MHEG-5

The ISO MHEG-5 standard provides a common application layer for the interactive and the broadcast domains. It was designed to work also with terminals with minimal resources, like the so-called set-top box that will be used to deliver digital TV.

3.1. Features

MHEG-5 allows shaping page oriented applications that consist of a set of scene objects. At most one scene is active at one time and navigation within an application is performed in terms of transitions between scenes. Inter-application navigation is also possible. Scenes provide support for spatially and temporally coordinated presentation of audiovisual contents comprising elementary graphics, bitmaps, texts and audiovisual streams. Interaction is performed via graphic controls such as buttons, sliders, text entry boxes and hypertext selections.

Every scene, as well as an entire application, is a self-contained entity that represents its local behavior by means of links that are event-action associations. Events can be generated by the user, by the expiration of timers, by the playback of streams as well as by other conditions internal to the execution process.

Since MHEG-5 adopted the object-oriented approach, all features are specified in terms of classes of objects. There are two major superclasses from which other classes inherit properties: the Group and the Ingredient. The Group class handles the grouping of ingredient objects as a unique entity of access and interchange. A Group can be specialized into "application" and "scene" classes. A Group cannot include another Group.

The "ingredient" class provides the common behavior for all objects that can be included in an application or a scene. There are two major kinds of ingredients: "presentable" and "link". Presentable ingredients support rendition of: bitmap, line-art, text, stream, audio, video, run-time graphics, and so on. The list also includes some specific user interaction classes: entry-field, slider, hotspot, and so on.

A link is used to represent dynamic behavior as an event-action association. The standard offers more than 100 actions to accomplish different types of behavior: running a presentation, changing the speed of a stream, its direction or its current time code position (random access), modifying the scene=92s layout, managing user interactions, controlling link activation, and so on.

Two encoding formalisms are supported: a binary notation (ASN.1) and a textual notation.

3.2. Current status

MHEG-5 reached the IS Status in November 1996, and is now in a "technical corrigendum" status, which is based on comments from ongoing implementations. MHEG-5 was recently complemented with MHEG-6, which specifies an MHEG-5 API to allow the use of MHEG-5 functionality from within a Java Script layer. Both MHEG-5 and MHEG-6 have been selected by DAVIC (Digital Audio Visual Council), and are under consideration by the DVB/MHP (Media Home Platform).

4. MPEG-4

MPEG-4 is a multipart ISO standard under development, which aim is "to bridge the interactive natural and synthetic virtual worlds". The following refers to the "Systems" part that addresses multimedia representation.

4.1. Features

MPEG-4 provides standardized ways to:

represent units of aural, visual or audiovisual content, called AVOs (audio/visual objects). These AVOs can be of natural or synthetic origin; this means they could be recorded with a camera or microphone, or generated with a computer;

compose these objects together to create compound audiovisual objects that form audiovisual scenes;

multiplex and synchronize the data associated with AVOs, so that they can be transported over network channels providing a QoS appropriate for the nature of the specific AVOs; and

interact with the audiovisual scene generated at the receiver=92s end.

Audiovisual scenes are composed of several AVOs, organized in a hierarchical fashion. The leaves are "primitive AVOs", such as a 2D background, the picture of a talking person (without the background) the voice associated with that person, and so on. MPEG-4 standardizes a number primitive AVOs, capable of representing both natural and synthetic content types, which can be either 2D or 3D. The range includes: text and graphics; talking heads and associated text to be used at the receiver=92s end to synthesize the speech and animate the head; and animated human bodies.

The scene composition borrows several concepts from VRML in terms of both its structure and the functionality of object composition nodes. The standard comes with a set of profiles: Simple, 2D, VRML, Audio and Complete (which is intended for use in conjunction with the VRML).

MPEG-4 Scenes are encoded in BIFS (Binary Format for Scenes), which is a compact format. An isomorphic "textual representation" is being used at the authoring level.

4.2. Current Status

The MPEG-4 FCD (Final Committee Draft) for Version 1 has recently been released. The IS status is scheduled by January 1999. Early prototypes of MPEG-4 players exist and are being improved for both 2D and 3D profiles. The work on Version 2 already started, with mandate to specify, besides the others, an advanced user interaction model, a Java API (known as MPEG-J) and an intermedia file format.

5. Authoring tools

Since the above standards do not specify the way applications are generated, there are ample opportunities to compete on the marketplace with tools based on different approaches. The range of solutions starts from tools that require writing specialized programs in ordinary programming languages, up to "authoring tools", that allow creating applications by means of visual based environments.

Authoring tools help create applications interactively: the author starts from a simple skeleton, and incrementally adds features. Generally, he is not constrained to follow a predefined sequence of tasks, and normally obtains the final result after a cycle of successive refinements that involves some basic operations:

creating the overall structure of the application, i.e. the various scenes, the navigation paths across them, the entry scene, and so on;

adding ingredients to scenes, i.e. media objects, interactive controls, and so on;

setting the specific properties of objects from different types. For media objects, this implies referencing actual content data, like bitmaps and video clips;

adjusting the scenes=92 layout, i.e. position of the various ingredients;

specifying the behavior. This essentially means providing a list of associations between an event (e.g. user interaction) and the related effect (e.g. start playback of a video clip); and

running the player, to offer immediate feedback of how the presentation looks like.

6. Conclusions

The emergence of different multimedia standards that - more or less emphatically - stake their claim on Interactive TV, makes evident that the scenario is being approached starting from different perspectives.

SMIL represents the Web perspective, which demands capability to synchronize streams with other elements in the framework of a single application. A native support by Web browsers will allow widespread availability of applications, and, thanks to the text format, a search engine is able to find a SMIL document on the Web. The simple syntax allows to build applications without the need of a sophisticated authoring tool.

MHEG-5 has good chances to be used in the upcoming ITV trials, according to announcements made by a number of TV players. This standard requires however an adequate authoring tool to generate professional applications.

MPEG-4 targets high-end applications. The specification is still being validated by "verification model" implementations, in both the 2D and 3D domains. Hardware add-ons are required to perform 3D playback on ordinary PCs. The availability of adequate authoring tools for MPEG-4 will be a big challenge, also for the simpler 2D profile.

There are reasons to envisage a strong competition at the level of authoring tools. In fact, besides usability, correctness and consistency of the generated applications should also be guaranteed. Probably, ITV users have more acquaintance with standard television than with PC/Internet, and, therefore, are not accustomed to failures such as user interface crashes or unresolved references when navigating across applications.