SMIL 2.0 Extension for Professional Multimedia Authoring

We are investigating the possibilities of SMIL 2.0 for professional multimedia content authoring. In this document we itemize requirements for content description capabilities of professional multimedia production and verify how much SMIL 2.0 can satisfy them. The requirements are categorized in terms of material identification and content description, which include issues not only about media synchronization but also about audiovisual special effects. Finally we propose several possible extensions to SMIL 2.0 function modules clarifying the limitations of current SMIL 2.0.

Status of this Document

This document is a Note that is the result of an acknowledged Submission from Sony Corporation, made available by the W3C for discussion only. Please read the Submission request and W3C Staff Comment. Publication of this Note by W3C indicates no endorsement of its content by W3C, nor that W3C has, is, or will be allocating any resources to the issues addressed by the Note. The list of acknowledged W3C Submissions is available at the W3C Web site.

This submission is a proposal for resuming SYMM Working Group to improve SMIL 2.0 Specification. This proposal intends to seek out the possiblity of SMIL for professional multimedia authoring. Introducing SMIL-based content authoring into professional multimedia production is also expected to enhance the possibility of SMIL-based contents on the Web.

This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.

1. Introduction

Most of the professional content production environment has been replaced with digital systems and networked with each other. Some of the consumer environment has also already been replaced with digital systems for quite a number of regional services. Such change grows expectations that both content production and delivery can be connected seamlessly from professional systems to consumer systems. In these senses, we have reported basic ideas on a future framework for integration of TV and the Web [ygonno9806] [ygonno9808] [fnishio9808] [ygonno9909].

On the other hand, the use of the XML-based content markup languages has been getting popular on the Web not only for conventional Web pages written in HTML but also sophisticated multimedia contents written in SMIL. SMIL (Synchronized Multimedia Integration Language) was developed by W3C (World Wide Web Consortium) in 1998 as SMIL 1.0[SMIL1.0]. After the improvement of the XML architecture, it was reconstructed as SMIL 2.0[SMIL2.0] in 2001 introducing sophisticated new functionality. SMIL 2.0 is based on the modularized design which is also the basis of the next generation XML applications from XHTML 1.1[XHTML1.1] downward.

In this document we preliminary investigate how such Web technologies can enhance professional production systems and conventional TV broadcasting services. In the following sections, we describe the background of professional content production workflow. And then we itemize requirements for the multimedia authoring functionality in the professional content production environment in terms of material identification and content description capabilities. Finally we propose several possible extensions to SMIL 2.0 function modules, verifying how much SMIL 2.0 can satisfy the requirements and clarifying the limitations of current SMIL 2.0.

2. Background of Professional Production Workflow

2.1 Destructive Authoring: Mixture of Media Materials and Content Descriptions

While the content authoring environment is going to be distributed and networked, the multimedia authoring systems are still taking conventional destructive approach to create new materials from source materials. In this approach, synchronization, transition, composition and other professional special effects are directly applied to the media materials, which means that all editorial metadata, such as time coordinates of synchronization and types of special effects, will be lost in a resulting materials or will be saved in proprietary formats at the utmost. We call this approach is destructive because resulting materials cannot preserve the information of original source materials. Such approach is neither convenient nor efficient in the case of modifying or re-using some effects independently of media materials themselves.

2.2 Non-Destructive Authoring: Separation of Media Materials and Content Descriptions

Considering above matters, we can take non-destructive approach separating content descriptions from media materials, which is a conventional way for markup languages on the Web, e.g. HTML and SMIL. By adopting this approach into the professional content authoring environment, the content production processes are expected to be drastically changed. During the content production processes, media materials do not need to be touched and deformed directly but just to be referred through networks or local file systems. The actual work of authoring contents will result in just marking up relationship and structure among media materials, describing synchronization, transition, composition and other special effects. The markup information of content structure can be saved and re-used separately from media materials. In this investigation we seek the possibility of SMIL 2.0 and its extension as a content markup language for this purpose.

3. Requirements for Professional Multimedia Authoring

Requirements for professional multimedia authoring can be discussed in two different aspects: one is how to identify media materials, the other is how to describe multimedia contents. In this section we itemize the requirements which must be satisfied for the professional multimedia authoring.

3.1 Material Identification

3.1.1 Unique Material Identifier (UMID)

UMID[SMPTE-UMID] is a prospective globally unique material identifier dedicated for the professional multimedia material identification standardized by SMPTE. Although an original basic UMID has 32 bytes data in binary format, UMID can be transformed into a textual format in the form of URI so that it can be dealt with as a conventional identifier in XML documents.

3.1.2 Media Track Identification

A single media material may have multiple audio and video tracks sharing a single timeline of the media. Any tracks must be identified independently as if each track was an individual media material.

3.1.3 Media Time Identification and Destination

Precision time codes, such as SMPTE time codes[SMPTE-Time], may be used for identifying the temporal coordinate not only of source materials but also of resulted materials. In order to allow editors to specify exact time addresses within a media material, precision time codes must be able to be embedded in any destination materials.

3.1.4 Temporal Duration Media Clipping

A part of visual and audio materials may be clipped into temporal segments with specific durations. And then such clipped materials may also be re-used as new source materials during the authoring process. Any clipped temporal segments must be identified as if each segment was an individual media material.

3.1.5 Spatial Region Media Clipping

A part of visual materials may be clipped into spatial segments with specific areas. And then such clipped materials may also be re-used as new source materials during the authoring process. Any clipped spatial segments must be identified as if each segment was an individual media material.

3.1.6 Intermediate Media Identification

Content authoring processes may consist of quite a number of elementary editing processes combined with one after another over the distributed workflow. Any intermediate media components must be identified as if each media component was an individual media material.

3.2 Content Description

3.2.1 Metadata Description

There are needs for dealing with metadata about contents based on various types of metadata schemas, e.g. MPEG-7, DublinCore, TV-Anytime, many of which are industry dependent metadata and not in RDF but only in XML. Such metadata must be properly embedded into content descriptions.

3.2.2 Metadata Synchronization

Some metadata need not only to be associated with media objects but also to be synchronized along with the timeline of media objects. Such metadata should be considered as a kind of media objects itself.

3.2.3 Media Transition

This is one of the most important functionality for the professional content authoring processes, which allows media materials to be connected to another with special effects. A conventional set of visual transition effects are defined in SMPTE 258M[SMPTE-EDL]. These transition effects must be supported at least. Furthermore, several conventional visual transition effects omitted from the standard, such as fading and dissolving, must also be supported. In addition to visual transition effects, behavior of audio transition effects must also be clearly defined.

3.2.4 Media Composition and Transformation

Although this is another important functionality for professional content authoring, there is no standardized definition on media composition and transformation effects. Here is a brief list of conventional composition and transformation effects.

Visual Transparency (Alpha-Blending)

Visual materials may be overlapped with some transparency specified by alpha parameters. Alpha-blending effects must be described on any visual materials.

Visual Deformation

Visual materials may be deformed by filtering original visual information. There are several conventional visual deformation effects, such as mosaic or frosting. Visual deformation effects must be described on specific areas of any visual materials.

Color Effect

Visual materials may be modified within color space. There are several conventional color effects, such as monotone, sepia or negative. Color effects must be described on any visual materials.

Chroma-Key Effect

Visual materials may be modified only on specific key colors while keeping the rest of colors untouched. In many case this effect is applied to blue-screen shots modifying all blue color areas into transparent. Chroma-key effects must be described on any visual materials.

Audio Track Manipulation

Professional audio materials usually have multiple track resources within a single media material. Such resources may be manipulated independently during the authoring processes, e.g. converting left and right, 5.1-stereo to 2-stereo. Such track manipulation must be described on any audio materials.

Sound Level Control

Sound levels of audio materials may be manipulated individually, so they must be described on any audio materials.

Sound Effect

Audio materials may be reverberated with specific parameters not only to make simple reverberations but also to simulate special sound environments such as concert hall or theater. Sound effects must be described on any audio materials.

3.2.5 Animated Text

This is also one of the most conventional functionality in professional media authoring processes. Any textual data must be overlaid and animated on visual materials with sophisticated layout functionality.

3.3 Profiles and Conformance

There are still several aspects in terms of system implementation rather than functionality of the language. Each professional authoring system product may have different minimum requirements depending on the purpose of products, e.g. video effector, audio mixer and cut editor. Considering efficient implementation of a wide range of products, individual functionality mentioned previous sections should be properly modularized. To keep conformance points between such products in reasonable order, the minimum language profile should be made as lightweight as possible.

4. Extension of SMIL 2.0

In this section we clarify the limitations of SMIL 2.0 and how SMIL 2.0 should be extended in order to satisfy the requirements. Each extension is not proposing a final solution, but rather intends to raise a discussion.

4.1 Media Object Modules

4.1.1 Professional Media Clipping

Although SMIL 2.0 MediaClipping Module allows clipBegin/clipEnd attribute to be specified with SMPTE time codes, there is no clear description about the behavior of the measurement in the media in which independent time codes are embedded. We define that clipBegin/clipEnd attributes with SMPTE time codes shall measure clipping points by embedded time codes in the media material even if they are inconsistent with normal media playback time.

Description Example
<ref src="media1" clipBegin="smpte=01:02:00:00" clipEnd="smpte=01:06:00:00"/> media1 \|=========+++++++++++++++++++++=========\| 01:00 01:02 01:04 01:06 01:08 ref \|+++++++++++++++++++\| 00:00 00:02 00:04 (Only + regions are active.) Given media1 is an 8-minute-long media which has embedded time codes starting from 01:00:00:00 not from 00:00:00:00, then this ref element represents a 4-minute-long media clipped from 2-minute position to 6-minite position of media1. <ref src="media2" clipBegin="smpte=00:02:00:00" clipEnd="smpte=00:06:00:00"/> media2 // \|=========\| // 01:00 01:08 ref \|=========\| 00:00 00:04 Given media2 is an 8-minute-long media which has the same time codes as media1, in this case, media2 has no actual media for the duration specified by clipBegin/clipEnd attributes. So the ref element here represents a 4-minute-long empty time container.

Description Example

<ref src="media1" clipBegin="smpte=01:02:00:00" clipEnd="smpte=01:06:00:00"/>

  media1 |=========+++++++++++++++++++++=========|
        01:00     01:02     01:04     01:06     01:08
  ref              |+++++++++++++++++++|
                  00:00     00:02     00:04
        (Only + regions are active.)

Given media1 is an 8-minute-long media which has embedded time codes
starting from 01:00:00:00 not from 00:00:00:00,
then this ref element represents a 4-minute-long media
clipped from 2-minute position to 6-minite position of media1.

<ref src="media2" clipBegin="smpte=00:02:00:00" clipEnd="smpte=00:06:00:00"/>

  media2               //    |=========|
                       //   01:00     01:08
  ref    |=========|
        00:00     00:04

Given media2 is an 8-minute-long media which has the same time codes as media1,
in this case, media2 has no actual media
for the duration specified by clipBegin/clipEnd attributes.
So the ref element here represents a 4-minute-long empty time container.

4.1.2 Media Type Casting

SMIL 2.0 Media Object Modules says that the MIME content type should claim how the media should be dealt with on the system. It is not reasonable to assume that all media types of intermediate media materials can be identified during the content authoring process. We introduce a regulation that a media object element shall be projected into the media type represented by the type of the media object element itself. For example, according to this regulation, an audio element shall project the media exclusively into the audio media type while a video element shall project the media exclusively into the video media type even if each media consists of audio and video.

4.1.3 Media Track Identification and Destination

Media type is not sufficient to identify and specify individual media track resources. We introduce an XPointer-like track identifier, track(), as an extension of conventional URIs to specify media track resources. We also introduce a cast attribute to specify a destination track. The names of media tracks are implementation dependent parameters, so they must be defined on the system.

Description Example
<audio src="media1#track('A1')" cast="right"/> This audio element represents an audio media resource projected into right track from A1 track of media1.

As a possible alternative description, we should also consider specifying track resources as attribute values.

Description Example
<audio src="media1" track="A1" .../>

4.1.4 Visual Area Clipping

SMIL 2.0 does not provide any means to identify and specify spatial regions out of whole view area. We introduce a coords attribute to specify a partial region out of original visual materials. The coordinates shall be specified with a pair of diagonal relative x-y coordinates measured from the top and left of the original visual area.

Description Example
<video src="media2" coords="0%,0%,50%,100%"/> This video element represents a video media resource clipped into left half out of the original visual area of media2.

4.1.5 Media Transparency Modification

While SMIL 2.0 BasicLayout Module allows the background color of region elements to be transparent by default, it does not provide any means to control the media transparency. In order to support alpha-blending effects, we introduce the alpha attribute not only to the media object elements but also to the time container elements. So this extension will affect on Timing and Synchronization Modules. The alpha attribute should have a value between 0.0 and 1.0 corresponding to the media transparency.

4.1.6 Audio Level Control

While the audio level can be set by the soundLevel attribute of the region element, it is not possible to specify individual sound levels to multiple audio materials laid on the same region. We introduce this attribute not only to the media object elements but also to the time container elements, So this extension will also affect on Timing and Synchronization Modules. The soundLevel attribute should have a value and a function defined in SMIL 2.0 AudioLayout Module.

4.2 Timing and Synchronization Modules

4.2.1 Media Time Destination

SMIL 2.0 does not provide any means to embed SMPTE time codes into media object elements nor time container elements. In order to identify exact time addresses of composed media elements, we introduce two extensions: one is that time container elements may accompany time attributes, begin and end, the other is that SMPTE time code values may be applied to time attributes. By this extensions, resulted elements can be regarded as actual media materials in which SMPTE time codes are embedded.

Description Example
<ref src="media" begin="smpte=01:00:00:00" end="smpte=01:05:00:00"/> media \|+++++++++++++++++++\| 03:00:00:00 03:05:00:00 (embedded time codes) 01:00:00:00 01:05:00:00 (applied time codes) This ref element represents a media resource which has exact SMPTE time codes from 01:00:00:00 to 01:05:00:00. These attributes override original SMPTE time codes which may be embedded in the source media. <seq begin="smpte=01:00:00:00"> <ref src="media1"/> <ref src="media2"/> </seq> media1 \|+++++++++++++++++++\| 03:00:00:00 03:04:00:00 (embedded time codes) media2 \|+++++++++++++++++++\| 02:00:00:00 02:04:00:00 (embedded time codes) seq \|+++++++++++++++++++++++++++++++++++++++\| 01:00:00:00 01:04:00:00 01:08:00:00 (applied time codes) This seq element represents a media resource which has exact SMPTE time codes starting from 01:00:00:00. These attributes override original SMPTE time codes which may be embedded in the source media, media1 and media2.

Description Example

<ref src="media" begin="smpte=01:00:00:00" end="smpte=01:05:00:00"/>

  media  |+++++++++++++++++++|
        03:00:00:00         03:05:00:00 (embedded time codes)
        01:00:00:00         01:05:00:00 (applied time codes)

This ref element represents a media resource
which has exact SMPTE time codes from 01:00:00:00 to 01:05:00:00.
These attributes override original SMPTE time codes
which may be embedded in the source media.

<seq begin="smpte=01:00:00:00">
  <ref src="media1"/>
  <ref src="media2"/>
</seq>

  media1 |+++++++++++++++++++|
        03:00:00:00         03:04:00:00 (embedded time codes)
  media2                     |+++++++++++++++++++|
                            02:00:00:00         02:04:00:00 (embedded time codes)
  seq    |+++++++++++++++++++++++++++++++++++++++|
        01:00:00:00         01:04:00:00         01:08:00:00 (applied time codes)

This seq element represents a media resource
which has exact SMPTE time codes starting from 01:00:00:00.
These attributes override original SMPTE time codes
which may be embedded in the source media, media1 and media2.

4.2.2 Time Container Clipping

SMIL 2.0 MediaClipping Module allows media clipping attributes, clipBegin and clipEnd, to be applied only to media object elements. This limitation disables time container elements to be clipped into temporal media segments. We extend this module to allow media clipping attributes to be applied to time container elements, par and seq. By this extension, time container elements can be clipped into temporal segments as if it was one of media object elements.

Description Example
<par clipBegin="smpte=01:02:00:00" clipEnd="smpte=01:06:00:00"> <ref src="media1" begin="smpte=01:00:00:00" end="smpte=01:04:00:00"/> <ref src="media2" begin="smpte=01:04:00:00" end="smpte=01:08:00:00"/> </par> media1 \|=========++++++++++\| media2 \|++++++++++=========\| 01:00 01:02 01:04 01:06 01:08 par \|+++++++++++++++++++\| 00:00 00:02 00:04 (Only + regions are active.) This par element represents a media resource which consists of the last half of media1 and the first half of media2.

Description Example

<par clipBegin="smpte=01:02:00:00" clipEnd="smpte=01:06:00:00">
  <ref src="media1" begin="smpte=01:00:00:00" end="smpte=01:04:00:00"/>
  <ref src="media2" begin="smpte=01:04:00:00" end="smpte=01:08:00:00"/>
</par>

  media1 |=========++++++++++|
  media2                     |++++++++++=========|
        01:00     01:02     01:04     01:06     01:08
  par              |+++++++++++++++++++|
                  00:00     00:02     00:04
        (Only + regions are active.)

This par element represents a media resource
which consists of the last half of media1 and the first half of media2.

4.3 Metainformation Module

SMIL 2.0 Metainformation Module allows the metadata element to include RDF descriptions. We extend this module so that this element can include any type of XML descriptions, introducing the type attribute which specifies MIME content type. The behavior for unknown metadata may be left implementation dependent.

Description Example
<head> <metadata type="text/xml"> <mp7:mpeg7 xmlns:mp7="urn:mpeg:mpeg7:..."> .... </mp7:mpeg7> </metadata> <metadata type="application/rdf+xml"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> .... </rdf:RDF> </metadata> </head>

Description Example

<head>
  <metadata type="text/xml">
    <mp7:mpeg7 xmlns:mp7="urn:mpeg:mpeg7:...">
      ....
    </mp7:mpeg7>
  </metadata>
  <metadata type="application/rdf+xml">
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      ....
    </rdf:RDF>
  </metadata>
</head>

While we also recognize there is already a proper way to embed non-RDF XML descriptions in RDF descriptions by introducing rdf:parseType="Literal", the extension introduced above seems still useful because of allowing us to avoid redundant descriptions of RDF.

4.4 Transition Effects Modules

4.4.1 Audio Transition

SMIL 2.0 does not define the behavior of audio during the transition. A possible definition is that the audio level is to be accompanied with the progress of transitions.

4.4.2 Effects on Time Container Elements

SMIL 2.0 allows transition effects to be applied only to media object elements. This limitation makes descriptions redundant in many situations, appending attributes to every media object included in a time container element. We extend this module to allow the transition attributes, transIn and transOut, to be applied to the time container elements, par and seq, so that transition effects can also be applied to time container elements directly.

4.5 Transformation Effect Modules

We define some typical functionality as an additional module group which is not included in SMIL 2.0 naturally and seems not suitable for extending any of the pre-defined module groups.

Transformation Effect Modules define one of the most typical functionality required for professional multimedia authoring, which describe audio and visual effects on media presentations. The syntax and structure of these modules can be designed similar to Transition Effect Modules. We define BasicTransformations Module and InlineTransformations Module. Each module provides the transformation element and the transformationFilter element, respectively. Both elements may have almost the same attributes that identify the nature of the effects, e.g. type, subtype, dur, begin, end, coords, whose functionality is not explained in this document in detail because they have almost the same functionality that is used in the other part of SMIL 2.0.

4.5.1 BasicTransformations Module

transformation element: This element may appear as a child of the head element.

transform attribute: This attribute are added to all media object elements listed in the Media Object Modules.

4.5.2 InlineTransformations Module

transformationFilter element: This element may appear as a child of all media object elements, e.g. audio and video, and all time container elements, par and seq.

4.5.3 Types of Transformation

We introduce following types of transformation based on the requirements. Each type value may accompany subtype attribute and other parameter values specifying the detail of the effect.

type value list

visualDeform: This type value represents visual deformation effects, which include mosaic, frosting and crystal effects as conventional subtypes.
colorEffect: This type value represents color modification effects, which include monotone, sepia and negative as conventional subtypes.
chromaKey: This type value represents chroma-key effects, which is accompanied by a key color value as a parameter of the effect.
soundEffect: This type value represents special sound effects, which include hall, theater and stadium as conventional subtypes.

4.6 Conformance to SMIL 2.0 and SMIL Professional Language Profiles

All these extensions and additions introduced in this investigation should be made based on the XML modularization framework. We also have to define additional language profiles keeping the conformance to SMIL 2.0 so that implementers can find the best conformance point of their products.

SMIL 2.0 requires that any SMIL 2.0 conformant language profile must include 11 functional modules listed in SMIL 2.0 Host Language Conformance. From the professional content authoring point of view, this requirement includes supplemental functionality. For example, BasicLinking Module, which provides one of essential functions of Web contents, is not essential for video editing systems. Requiring proper implementation of such supplemental functionality would possibly raise the cost of low-end products. In that sense, a minimum SMIL profile needs to be reconsidered being made as lightweight as possible.

SMIL 2.0 Extension for Professional Multimedia Authoring - Preliminary Investigation

W3C Note 12 May 2003

Abstract

Status of this Document

Table of contents

1. Introduction

2. Background of Professional Production Workflow

2.1 Destructive Authoring: Mixture of Media Materials and Content Descriptions

2.2 Non-Destructive Authoring: Separation of Media Materials and Content Descriptions

3. Requirements for Professional Multimedia Authoring

3.1 Material Identification

3.1.1 Unique Material Identifier (UMID)

3.1.2 Media Track Identification

3.1.3 Media Time Identification and Destination

3.1.4 Temporal Duration Media Clipping

3.1.5 Spatial Region Media Clipping

3.1.6 Intermediate Media Identification

3.2 Content Description

3.2.1 Metadata Description

3.2.2 Metadata Synchronization

3.2.3 Media Transition

3.2.4 Media Composition and Transformation

Visual Transparency (Alpha-Blending)

Visual Deformation

Color Effect

Chroma-Key Effect

Audio Track Manipulation

Sound Level Control

Sound Effect

3.2.5 Animated Text

3.3 Profiles and Conformance

4. Extension of SMIL 2.0

4.1 Media Object Modules

4.1.1 Professional Media Clipping

4.1.2 Media Type Casting

4.1.3 Media Track Identification and Destination

4.1.4 Visual Area Clipping

4.1.5 Media Transparency Modification

4.1.6 Audio Level Control

4.2 Timing and Synchronization Modules

4.2.1 Media Time Destination

4.2.2 Time Container Clipping

4.3 Metainformation Module

4.4 Transition Effects Modules

4.4.1 Audio Transition

4.4.2 Effects on Time Container Elements

4.5 Transformation Effect Modules

4.5.1 BasicTransformations Module

4.5.2 InlineTransformations Module

4.5.3 Types of Transformation

type value list

4.6 Conformance to SMIL 2.0 and SMIL Professional Language Profiles

5. Summary

References

Acknowledgment