State of the Art

From Media Fragments Working Group Wiki
Revision as of 04:55, 19 April 2009 by Pjgenste (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search


  • HTTP / URI Fragment: RFC 3986: (January 2005)
  • RTSP Fragment: [| RFC2326] specifies a protocol mechanism for addressing time offset, but leaves the URI fragment syntax to the implementation of the server (search for 'fragment' in the RFC)
  • See also the: Technology Survey

Web Architecture

Temporal Fragment

SMIL (Jack)

Playing temporal fragments out-of-context

SMIL allows you to play only a fragment of the video by using the clipBegin and clipEnd atributes. How this is implemented, though, is out of scope for the SMIL spec (and for http-based URLs it may well mean that implementations get the whole media item and cut it up locally):

<video xml:id="toc1" src=""
       clipBegin="12.3s" clipEnd="21.16s" />

It it possible to use different time schemes, which give frame-accurate clipping when used correctly:

<video xml:id="toc2" src=""
       clipBegin="npt=12.3s" clipEnd="npt=21.16s" />
<video xml:id="toc3" src=""
       clipBegin="smpte=00:00:12:09" clipEnd="smpte=00:00:21:05" />

Adding metadata to such a fragment is supported since SMIL 3.0:

<video xml:id="toc4" src=""
       clipBegin="12.3s" clipEnd="21.16s">
        <rdf:.... xmlns:rdf="....">

Referring to temporal fragments in-context

The following piece of code will play back the whole video, and during the interesting section of the video allow clicking on it to follow a link:

<video xml:id="tic1" src="">
    <area begin="12.3s" end="21.16s" href=""/>

It is also possible to do have a link to the relevant section of the video. Jumping to #tic2area will start the video at the beginning of the interesting section. The presentation will not stop at the end, however, it will continue.

<video xml:id="tic2" src="">
    <area xml:id="tic2area" begin="12.3s" end="21.16s"/>

Playing spatial fragments out-of-context

SMIL 3.0 allows playing back only a specific rectangle of the media. The following construct will play back the center quarter of the video:

<video xml:id="soc1" src=""
       panZoom="25%, 25%, 50%, 50%"/>

Assuming the source video is 640x480, the following line plays back the same:

<video xml:id="soc2" src=""
       panZoom="160, 120, 320, 240" />

This construct can be combined with the temporal clipping.

It is possible to change the panZoom rectangle over time. The following code fragment will show the full video for 10 seconds, then zoom in on the center quarter over 5 seconds, then show that for the rest of the duration. (The video may be scaled up or centered, or something else, depending on SMIL layout, but this is out of scope for the purpose of this investigation).

<video xml:id="soc3" src=""
       panZoom="0, 0, 640, 480" />
    <animate begin="10s" dur="5s" fill="freeze" attributeName="panZoom"
             to="160, 120, 320, 240 />

Referring to spatial fragments in-context

The following bit of code will enable the top-right quarter of the video to be clicked to follow a link. Note the difference in the way the rectangle is specified (left, top, right, bottom) when compared to panZoom (left, top, width, height). This is an unfortunate side-effect of this attribute being compatible with HTML and panZoom being compatible with SVG.

<video xml:id="tic1" src="">
    <area shape="rect" coords="50%, 0%, 100%, 50%" href=""/>

Other shapes are possible, as in HTML. The spatial and temporal constructs can be combined. The spatial coordinates can be animated, as for panZoom.

Selecting tracks

SMIL has no way to selectively enable or disable tracks in the video. It only provides a general parameter mechanism which could conceivaby be used to comminucate this information to a renderer, but this would make the document non-portable. Moreover, I know of no such implementations.

<video xml:id="st1" src="">
    <param name="jacks-remove-track" value="audio" />

Named fragments

SMIL has no way to show named fragments in the base material out-of-context. It has no support for referring to named fragments in-context either, but it does have support for referring to "media markers" (named points in time in the media) if the underlying media formats supports them:

<video xml:id="nf1" src="">
    <area begin="nf1.marker(jack-frag-begin)" end="nf1.marker(jack-frag-end)"

CMML (Silvia)

CMML is a markup language for timed media, i.e. it creates temporally non-overlapping clippings of time that relate to a video or audio file. It is part of a whole architecture of client-server video communication around [RFC 3533 Ogg]. The Ogg files are required to have a skeleton, through which it is easy to identify the mime types of the tracks that consist an Ogg file. A temporal URI is used to specify subparts of videos.

Playing/Referring to temporal fragments out-of-context

A temporal URI is being used to play back temporal fragments in Annodex. The clip's begin and end are specified directly in the URI. When using "#" the URI fragment identfier, it is expected that the media fragment is played after downloading the complete resource, while using "?" URI query parameters, it is expected that the media fragment is extracted on the server and downloaded as a new resource to the client. Linking to such a resource looks as follows:

<a href="" />
<a href="" />

It it possible to use different time schemes, which give frame-accurate clipping when used correctly:

<a href="" />
<a href="" />
<a href="" />

This can also be used, for example, in a HTML5 video or audio tag:

<video src=""/>
<audio src=""/>

Creating metadata for such a clip would be simply done by referring to the clip in the annotation. For example, using RDF, it could lok like this:

@prefix event: <>.
@prefix ps: <>.
@prefix rdfs: <>.
:chorus1 a ps:Chorus;
    rdfs:label "First chorus";
    event:time [
       url "";

Referring to temporal fragments in-context

To include outgoing hyperlinks into video, you have to define the time-aligned markup of your video (or audio) stream. For this purpose, Annodex uses CMML. Here is an example CMML file that can be used to include out-going hyperlinks next to or into Ogg streams. ("next to" means here that the CMML file is kept separate of the Ogg file, but that the client-side player knows to synchronise the two, "into" means that CMML is multiplexed as a timed text codec into the Ogg physical bitstream creating only one file that has to be exchanged). The following defines a CMML clip that has an outgoing hyperlink (this is a partial document extracted from a CMML file):

<clip id="tic1" start="npt:12.3" end="npt:21.16" title="Introduction">
 <a href="">Watch another fish video.</a>
 <meta name="author" content="Frank"/>
 <img src="fish.png"/>
  This is the introduction to the film Joe made about fish.

Not how there is also the possibility of naming a thumbnail, providing metadata, and giving a full description of the clip in the body tag.

Interestingly, you can also address into temporal fragments of a CMML file, since it is a representation of a time-continuous data resource:

<a href="" />

And you can address into named temporal regions of a CMML file:

<a href=""tic1" />

Playing/Referring to spatial fragments out-of-context and in-context

CMML/Annodex/temporal URI has no means of specifying and addressing spatial fragments.

Selecting tracks

Tracks are an orthogonal concept to time-aligned annotations. Therefore, Xiph/Annodex have invented another way of describing/annotating these. It's only new (since January 2008) and is called: ROE (for Rich Open multitrack media Encapsulation). With ROE you would describe the composition of your media resource on the server. This file can also be downloaded to a client to find out about the "capabilities" of the file. It is however mainly used for authoring-on-the-fly. Depending on what a client requires, the ROE file can be used to find the different tracks and multiplex them together. Here is an example file:

   <link rel="alternate" type="text/html" href="" />
   <track id="v" provides="video">
    	<mediaSource id="v0" src="" content-type="video/ogg" />
     	<mediaSource id="v1" src="" content-type="video/theora" />
   <track id="a" provides="audio">
     <mediaSource id="a1" src="" content-type="audio/vorbis" />
   <track id="c1" provides="caption">
     <mediaSource src="" content-type="text/cmml" />
   <track id="c2" provides="ticker">
     <mediaSource src="" content-type="text/cmml" />

This has not completely been worked through and implemented, but Metavid is using ROE as an export format to describe the different resources available as subpart to one media resource. We actually tried creating a SMIL profile here, but it didn't quite work out with the required elements and attributes and would have created a more talkative specification. Instead, we borrowed from SMIL what was appropriate.

And ROE is also used to create Skeleton in a final multiplexed file. Thus, the information inherent in ROE goes into the file (at least virtually) and can be used to extract tracks in a URI:

<video src=""/>

Named fragments

With CMML and ROE, all tracks and clips that are specified can be addressed through a URI - both in-context and out-of-context.

<video src="" />

MPEG-21 Part 17: Fragment Identification of MPEG Resources (Davy / Silvia)

Four different schemes are specified in MPEG-21 Part 17 to address parts of media resources:

  • ffp()
  • offset()
    • applicable to any digital resource.
    • identifies a range of bytes in a data stream.
    • similar functionality as the HTTP byte range mechanism
    • example:
  • mp()
    • applicable for media resources whose Internet media type (or MIME type) is equal to audio/mpeg, video/mpeg, video/mp4, audio/mp4, or application/mp4.
    • provides two complementary mechanisms for identifying fragments in a multimedia resource:
    • a set of so-called axes (i.e., temporal, spatial or spatiotemporal) which are independent of the coding/container format.
    • a hierarchical logical model of the resource. Such a logical model is dependent on the underlying container format (e.g., audio CD contains a list of tracks). The structures defined in these logical models are accessed with a syntax based on XPath.
    • for the temporal axis, the following time schemes are supported: NPT, SMPTE, MPEG-7, and UTC.
    • for the spatial axis, the following shapes are supported: ellipse, polygon, and rectangle
    • support for moving regions is also present
    • examples:
  • mask()
    • applicable for media resources whose Internet media type (or MIME type) is equal to video/mp4 or video/mpeg.
    • address a binary (1-bit deep) mask defined in a resource. Note that this mask is meant to be applied to a video resource and that the video resource may itself be the resource that contains the mask.
    • example:

Hierarchical combinations of addressing schemes are also possible. The '*' operator is used for this purpose. When two consecutive pointer parts are separated by the '*' operator, the fragments located by the first pointer part (to the left of the '*' operator) are used as a context for evaluating the second pointer part (to the right of the '*' operator). The following example addresses the first 50 seconds in a bitstream which is first located using its item_ID:*mp(~time('npt','0','50'))

Spatial Fragment (Raphael)

Image Maps

  • Definitions:
  • Client-side image maps:
    • The MAP element specifies a client-side image map. An image map is associated with an element via the element's usemap attribute. The MAP element content model includes then either AREA elements or A elements for specifying the geometric regions and the link associated with them
    • Possible shapes are: rectangle (rect), circle (circle) or arbitrary polygon (poly)
    • Example:
<img src="image.gif" usemap="#my_map"/>
<map name="my_map">
  <a href="guide.html" shape="rect" coords="0,0,118,28">Access Guide</a> |
  <a href="shortcut.html" shape="rect" coords="118,0,184,28">Go</A> |
  <a href="search.html" shape="circle" coords="184,200,60">Search</A> |
  <a href="top10.html" shape="poly" coords="276,0,276,28,100,200,50,50,276,0">Top Ten</A>


<map name="my_map">
  <area href="guide.html" alt="Access Guide" shape="rect" coords="0,0,118,28">
  <area href="search.html" alt="Search" shape="rect" coords="184,0,276,28">
  <area href="shortcut.html" alt="Go" shape="circle"coords="184,200,60">
  <area href="top10.html" alt="Top Ten" shape="poly" coords="276,0,276,28,100,200,50,50,276,0">
  • Server-side image maps:
    • When the user activates the link by clicking on the image, the screen coordinates are sent directly to the server where the document resides. Screen coordinates are expressed as screen pixel values relative to the image. The user agent derives a new URI from the URI specified by the href attribute of the A element, by appending ? followed by the x and y coordinates, separated by a comma.
    • Example:
<a href="">
  <img src="image.gif" ismap alt="target"/>

For instance, if the user clicks at the location x=10, y=27 then the derived URI is:,27

  • Discussion:
    • It is recommended to indicate which regions will have which behavior using a rollover effect
    • When using an editor capable of layering images such as Photoshop or GIMP, sections of the image may be cut and pasted in place over a copy of the image which has reduced brightness. These highlighted areas will stand out to the user.
    • The major issue that needs to be remembered is that image maps can't be indexed by most search engines.


  <Image id="image_yalta">           <!-- whole image -->
      <StillRegion id="SR1">          <!-- still region -->
            <Box>14.64 15.73 161.62 163.21</Box>

Scalable Vector Graphics

  <svg xmlns:svg=""
    <g id="layer1">
      <"image" id="image_yalta" x="-0.34" y="0.20" width="400" height="167" 
      <"rect" id="SR1" x="14.64" y="15.73" width="146.98" height="147.48" 

Note: this would require the image is available as SVG?


  • MPEG-7 approach requires an indirection!
    • MPEG-7 description is XML documents identified by a URL
    • RDF annotation will be about a fragment of this XML document
    • The XML code needs to be processed to dereference the region
  • SVG allows direct access to a region in a URI ... but it requires an SVG container

Research Papers