MPEG-4 Systems Version 1


MPEG-4 Systems CD Version 1 contains the basic set of tools to reconstruct a synchronous interactive and streamed audiovisual scene : timing and buffer model (SDM), identification and association of scenes and streams (Object and Elementary Streams Descriptors), scene description (BIFS), synchronization of streams (Sync. Layer) and efficient multiplexing of streams (FlexMux). In addition to these basic tools, MPEG-4 Systems provides for the coding of object content information (OCI) and back channel functionality. A partition of the set of the BIFS nodes defines the five Systems profiles : Simple, 2D,"VRML", Audio and Complete profile. The version 1 set of tools are described below.


1. Systems Decoder Model  (SDM)

The Systems Decoder Model provides an abstract view of the behavior of an MPEG-4 terminal. Its purpose is to allow a sender to predict how the receiver will behave in terms of buffer management and synchronization when reconstructing the audiovisual information that comprises the session. The Systems Decoder Model includes a timing model and a buffer model.

1.2. Timing Model

The Timing Model defines the mechanisms through which a receiver establishes a notion of time and performs time-dependent events. This allows the receiver to maintain synchronization both across and within particular media types as well as with user interaction events. The Timing Model requires that the transmitted data streams contain implicit or explicit timing information. Two sets of timing information are defined: clock references and time stamps. The former are used to convey the sender's time base to the receiver, while the latter convey the time (in units of a sender's time base) for specific events such as the desired decoding or composition time for portions of the encoded audiovisual information.

1.3. Buffer Model

The Buffer Model enables the sender to monitor and control the buffer resources that are needed to decode each individual Elementary Stream in the session. The required buffer resources are conveyed to the receiver by means of Elementary Streams Descriptors at the beginning of the session, so that it can decide whether or not it is capable of handling this particular session. The model allows the sender to specify when information is removed from these buffers and schedule data transmission so that overflow does not occur.

2. Object Descriptor Framework

The purpose of the object descriptor framework is to identify, describe and associate elementary streams with the various components of an audiovisual scene.

An object descriptor is a collection of one or more Elementary Stream descriptors that provide configuration and other information for the streams that relate to a single object (media object or scene description). Object Descriptors are themselves conveyed in elementary streams. Each object descriptor is assigned an identifying number (Object Descriptor ID), which is unique within the current session. This identifier is used to associate media objects in the Scene Description with a particular object descriptor, and thus the elementary streams related to that particular object.

Elementary Stream descriptors include information about the source of the stream data, in form of a unique numeric identifier (the Elementary Stream ID) or a URL pointing to a remote source for the stream. ES Descriptors also include information about the encoding format, configuration information for the decoding process and the Sync Layer packetization, as well as quality of service requirements for the transmission of the stream and intellectual property identification. Dependencies between streams can also be signaled, for example to indicate dependence of an enhancement stream to its base stream in scalable audio or visual object representations, or the availability of the same speech content in various languages.

3. Interactive Scene Description  (BIFS)

Scene description addresses the organization of audiovisual objects in a scene, in terms of both spatial and temporal positioning. This information allows the composition and rendering of individual audiovisual objects after their respective decoders reconstruct them. This specification, however, does not mandate particular composition or rendering algorithms or architectures since they are implementation-dependent.

The scene description is represented using a parametric methodology (BIFS - Binary Format for Scenes). The description consists of an encoded hierarchy (tree) of nodes with attributes and other information (including event sources and targets). Leaf nodes in this tree correspond to particular audio or visual objects (media nodes), whereas intermediate nodes perform grouping, transformation, and other operations (scene description nodes). The scene description can evolve over time by using scene description updates.

In order to allow active user involvement with the presented audiovisual information, this specification provides support for interactive operation. Interactivity mechanisms are integrated with the scene description information, in the form of linked event sources and targets (routes) as well as sensors (special nodes that can trigger events based on specific conditions). These event sources and targets are part of scene description nodes, and thus allow close coupling of dynamic and interactive behavior with the specific scene at hand. The MPEG-4 standard, however, does not specify a particular user interface or a mechanism that maps user actions (e.g., keyboard key presses or mouse movements) to such events.

Local or client-side interactivity is provided via the routes and sensors mechanism of BIFS. Such an interactive environment does not need an upstream channel. The MPEG-4 Standard also provides means for client-server interactive sessions with the ability to set up upchannel elementary streams.

4. Synchronization of Streams (Sync. Layer)

The Elementary Streams are the basic abstraction for any data source. Elementary Streams are conveyed as SL-packetized (Sync Layer-packetized) streams at the Stream Multiplex Interface. This packetized representation additionally provides timing and synchronization information, as well as fragmentation and random access information. The SL extracts this timing information to enable synchronized decoding and, subsequently, composition of the Elementary Stream data.

5. Multiplexing of Elementary Streams (FlexMux)

MPEG-4 defines the concept of TransMux Layer, a generic abstraction of the transport protocol stacks of existing delivery layers that may be used to transmit and store content complying with the MPEG-4 Standard. The functionality of this layer is not in the scope of this specification, and only the interface to this layer is defined. A wide variety of delivery mechanisms exists below this interface, i.e., a file is considered a particular instance of a TransMux. For applications where the desired transport facility does not fully address the needs of an MPEG-4 service, a simple multiplexing tool (FlexMux) is defined that provides low delay and low overhead.

6. OCI Data Stream

An Object Content Information (OCI) stream carries descriptive information about audiovisual objects. The stream is organized in a sequence of small, synchronized entities called events that contain information descriptors. The main content descriptors are: content classification descriptors, keyword descriptors, rating descriptors, language descriptors, textual descriptors, and descriptors about the creation of the content. These streams can be associated to other media objects with the mechanisms provided by the Object Descriptor. When Object Content Information is not time variant, (and therefore does not need to be carried in an elementary stream by itself), it can be directly included in the related ES Descriptor(s).