An Overview of SMIL Timing and Media Attributes for Audio and Video Objects

W3C Working Group Note 20 November 2007

This version:
Latest version:
Dick Bulterman, CWI


This documents provides an overview of the basic timing, media control and temporal linking concepts that are used within a SMIL presentation. It should enable maximum re-use of existing Web technology in other XML languages.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is only an editor's copy.

This document is a Public Working Draft of a future Working Group Note. It has been produced by the SYMM Working Group as part of the W3C Synchronized Multimedia Activity. The goals of the SYMM Working Group are discussed in the SYMM Working Group Charter. The authors of this document are the SYMM Working Group members. Different parts of the document have different editors.

Feedback can be directed to the public mailing www-smil@w3.org - (public archives) including the prefix'[SMIL AudioVideoControlConcepts - Note]' in the subject line.

Publication as a Working Group Note does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Content


An important topic at this year's W3C Technical Plenary meeting was the elevation of video and audio objects as first-class objects in an HTML presentation. In the current HTML-5 specification, a new set of elements and attributes are proposed to facilitate this promotion. Since the W3C's Synchronized Multimedia working group (SYMM) has been looking into related issues for the past 11 years in several editions of the SMIL language, we felt that it might be useful to summarize SMIL timing and media control concepts already in place, so that maximum re-use of existing Web technology can be facilitated.

In this Note, we consider the following topics:

We conclude with a comparison table of HTML-5 control attributes and their SMIL equivalents.

Note: at first glance, the direct integration of audio and video objects into a host-level language seems like a straight-forward task of activation and runtime control. In reality, there are many complexities that result from the fact that being a first-class citizen of the Web really means that audio and video become peer-level citizens with other content, such as text, images and (animated) graphics. This means that synchronization and coordination of peer-level components should not be ignored.

1. Introduction to SMIL Timing Concepts

This section provides an overview of the basic timing concepts that are used within a SMIL presentation. This overview is intended to be a brief introduction, which means that not all of the aspects of special-case timing will be considered.

A SMIL presentation distinguishes two types of timing control: control that determines how much of a particular media object is rendered (and looped) during a single instance of its presentation, and control that determines when a particular media object (or sub-structure of objects) gets activated and terminated relative to other objects in the presentation.

1.1 SMIL Time Containers

A SMIL presentation does not consist of one fixed timeline, but a nested collection of timelines -- some with pre-defined scheduled durations, some within interactive durations. The hierarchy of timelines is specified using three time containers: the parallel container (<par>), the sequential container (<seq>) and the exclusive container (<excl>). Of these, the <par> container is the most general: it defines a generic timeline on which its children can be scheduled. The <seq> provides a convenience container in which each of the children are scheduled by default to start at the conclusion of their lexical predecessor. The <excl> container allows a number of peer-level candidates to be specified, of which only one will be active at any given point -- starting one of the other peers typically replaces the currently-active peer.

Each continuous media object also defines a pseudo time container: it defines a time base that can be used to bring various pieces of supplemental information (such as link anchors) into scope.

The set of SMIL time containers provides a basis for inter-media synchronization. The following fragment illustrates this:

    <video  ... >
    <audio ... />

In an HTML reuse of SMIL, there are three options for supporting time container concepts:

  1. Support a full integration of SMIL and HTML features in a manner similar to XHTML+SMIL, as is done in Microsoft's Internet Explorer 5.5 and later,
  2. Support a partial integration of SMIL by integrating the current proposal on SMIL Timesheets 1.0,
  3. Implement pseudo-containers locally in HTML-5. The potential pitfall here is that a quick integration of video and audio as autonomous objects will lead to problems in future versions when more coordination or selection among multiple items is required.

Whatever the choice, it seems useful to maintain a time-container model for the media objects themselves.

1.2 Relevant SMIL Duration Concepts

There a several duration-related concepts in SMIL that are important for reuse in HTML, since these concepts highlight various timing control granularities that will need to be supported. These are:

Concept Definition and Use
inherent duration

The inherent duration is the 'natural' duration of a media object.

The inherent duration of an audio or video object is the duration of the encoding itself. The inherent duration of a piece of (untimed) text or an image is usually thought of as zero seconds, although it actually is the duration of the shortest time increment measured by the scheduler. (There is a difference in persistence behavior between an object that never gets rendered and an object that gets rendered very briefly.)

simple duration The simple duration is the inherent duration of a media object, possibly modified by the specification of a temporal subset using clipBegin and clipEnd attributes. The simple duration can be overridden using the dur attribute.
active duration The active duration is the simple duration of a media object, possibly extended by specifying a loop count or a loop duration.

An object's inherent duration is not always easy to determine. Some media formats include duration information as part of the object's header, but this is not always the case. Live media feeds (which are globally continuous and have no set beginning or end) have no defined inherent duration. Also, many media encodings do not define the inherent duration of their object; in these cases, the only way to determine the inherent duration is to scan the entire media file.

2. Controlling a Single Media Instance

This section reviews the attributes used to define a single instance of a media object's activation. The attributes are discussed in the following table.

Attribute Name Attribute Values Description Example
clipBegin time value Defines the temporal offset within the media object that serves as the start of a clip. Defaults to '0s'. <video ... clipBegin="3s" ... />
clipEnd time value Defines the temporal offset within the media object that serves as the end of a clip. Defaults to the temporal end of the media object. <video ... clipEnd="12s" ... />
repeatCount integer Defines the number of times the media object (sub-clip) repeats. <video ... repeatCount="3" ... />
repeatDur time value Define the duration of the looped media object. (The scheduler must determine that actual number of loops required to satisfy the behavior.) <video ... repeatDur="28s" ... />
fill "remove" | "freeze"

(plus others)

Determines the visual persistence of a media object after the end of its active duration has been reached. The value 'freeze' keeps the last frame in view, the value 'remove' clears to the background color of the rendering space. <video ... fill="freeze" ... />

Although the active duration defines the active lifetime of a media object, most visual objects can also have persistence behavior. This is especially useful within a time container in which not all objects end at the same time.

SMIL provides a rich time value syntax, ranging from simple time in seconds to full SMPTE support. Interested readers should consult the SMIL Timing and Synchronization module for details.

3. Controlling Peer Media Activation and Termination

In a SMIL presentation, the start and end of a particular media object does not occur in a vacuum. It must be synchronized with other objects that are also defined in the presentation (or, in HTML terms) on the page. SMIL distinguishes two types of activation behavior: scheduled activation/termination and event-based activation/termination. It is possible to mix both scheduled and event-based activation/termination: the first timing control value that gets resolved will be used to control the object.

The following table reviews the attributes used to define the timing of a media object within the greater context of a presentation. In SMIL terms, these attributes define how a media object behaves relative to its parent time container.

Attribute Name Attribute Values Description Example
begin time value or event definition. Defines the begin time of the media object relative to the parent time container either as a time value relative to the media object (or time container), or as an event name. Multiple begin times may be specified; the object will begin when the first of these is resolved and matches the current presentation time. Special cases exist for negative begin times. <video ... begin="3s" ... />
end time value or event definition Defines the explicit end of the active duration, either as a time value relative to the media object (or time container), or as an event name. <video ... end="12s" ... />
dur time value Defines the simple duration of a media object or time container. The simple duration may be extended by specifying a repeat count/duration. <video ... dur="10s" ... />
endsync "first" | "last" | id-value

(plus others)

Defines when a (parent) time container ends: when the first child ends, when the last child ends, when a particular child with the given id ends. <par ... endsync="mainVideo" ... />
restart "always" | "whenNotActive" | "never"

(plus others)

Determines whether an element that has already started can be restarted based on an event or a scheduled time. <video ... restart="always" ... />

Many of these attributes determine the behavior of an object relative to other objects in the presentation.

At the start of each time container, all of the timing of its constituents may be known to the SMIL scheduler; in this case, all timing is resolved. Interactive timing is typically not known at the start of the time container; this timing is unresolved. The difference between resolved and unresolved timing is important when defining temporal hyperlinks with a media object.

4. SMIL Temporal Linking Concepts, Elements and Attributes

One of the most powerful interaction features of SMIL is the ability to specify time-variant anchors that allow temporal navigation across a presentation or presentation component. The key to this feature is that SMIL does not place anchors in the media content, but defines anchors as peer-level content that is activated along with the media object. Each of the anchors has a visual component (defined by an area attribute), a scheduled component (defined by begin/end/dur attributes) and a link target component (defined by an href attribute).

The following fragment introduces the basics of temporal linking:

    <video src="video.ogg" title="Interview" >
        <area begin="3s" dur="10s" id="first question" 
        <area begin="20s" dur="20s" title="first answer" 

In this example, the video object has two anchors defined: one begins 3 seconds into the video and is active for 10s, while the second begins 20 seconds into the video and remains active for 20s. In this example, the anchor covers the entire visual area of the video object. A specific shape and placement relative to the object (using the shape and coords attributes) could also been defined.

If the first anchor is activated, then a temporal seek is done within the current presentation to the node with the id question. SMIL defines a rich seek semantic to determine which peer objects are activated at this point; most of these facilities are not needed if only a single object is active.

If the second anchor is activated, an external presentation is started. SMIL allows the play state of the source and destination object to be set to play/stop/pause, so that a temporal context can be restored if the source is reactivated.

Another use of temporal anchors is the segmentation of a video element. The following example illustrates this process:

    <video src="video.ogg" title="Interview" >
       <anchor id="video_scene1" begin="0s" dur="10s"/>
        <anchor id="video_scene2" begin="12s" end="28s"/>

This would enable the use of the standard URL fragment identifier scheme to infer a timed position inside an embedded video, e.g.:


5. Comparing Existing HTML-5 and SMIL Timing Markup

This section provide a comparison table between proposed HTML-5 media markup and equivalent facilities in SMIL.

SMIL Concept HTML-5 Concept Notes
clipBegin start and loopstart clipBegin in SMIL applies to all loops of a media object; in HTML-5, start seems to define the initial index into the object, while loopstart defines the index for the second and subsequent loops.
clipEnd end and loopend The relationship between HTML-5 end and loopend is not clear. SMIL distinguishes between the end of the simple duration and the end of the sub-clip.
repeatCount playcount These are identical concepts.
repeatDur <not defined> No explicit setting of the active duration is available. How does HTML-5 specify "repeat forever"?
fill ~poster and implied The equivalent fill behavior is freeze. Interaction with the poster attribute is not specified.
begin implied Objects become active when they are playable or based on control-events. No begin delay is provided in HTML-5.
end <not defined> No explicit setting of the duration. Note: the HTML-5 end attribute functions as clipEnd.
dur <not defined> No explicit override of the simple duration.
endsync <not defined> Not used, since no synchronization across obects is provided.
restart implied The implied value is always.
implied height The effective height of the object in SMIL is determined by layout properties plus the fit attribute.
implied width The effective width of the object in SMIL is determined by layout properties plus the fit attribute.
implied controls SMIL leave runtime control the to renderer interface.
implied poster SMIL layout defines a background image on the rendering region to provide this functionality.

Other issues:

  1. Several SMIL layout attributes allow an author to determine if a video is scaled, and whether or not it is clipped to the rendering region. This behavior is hard-coded in the HTML-5 specification: the video is scaled to the rendering space, preserving the aspect ratio. Note that this can have substantial performance and quality impact.
  2. Audio and video elements have separate element names, but only minor differences in processing behavior. By replacing both with a <media> element, it would seem that more flexible processing could be allowed.
  3. Persistence behavior of an video object is implied in HTML-5: at the end of the video, the last frame is shown. SMIL allows an author to specify this behavior. Since HTML-5 includes a poster attribute, it seems likely that the poster could be a candidate for display instead of the final frame, which is often black.
  4. Video and audio objects may have text content in HTML-5; these could be used if a particular encoding is not supported. SMIL handles this behavior using a <switch> wrapper, so that not only text but an alternative presentation -- such as a slideshow -- could be presented. A <switch> also allows selection among different encodings of the video.
  5. In HTML-5, the src attribute points to the audio/video element. It should be possible to have this point to other objects as well, such as a SMIL presentation that would be embedded into the HTML page. The SMIL rendering agent could be expected to support the player controls defined by HTML-5.
  6. The autoplay attribute specifies whether an object should begin playing 'as soon as it can do so without stopping'. The only way to be sure of this is to prefetch the object, which may not be realistic or possible. SMIL includes a <prefetch> element to control media availability.
  7. HTML-5 makes a distinction between the start and loopstart offset. Our assumption is that the start attribute specifies the initial offset into the video, while loopstart determines the beginning offset of the second and subsequent loops.
  8. The end attribute in HTML-5 has a default value of infinity. Shouldn't this be the end of the media resource?