Existing Technologies Survey

From Media Fragments Working Group Wiki
Jump to: navigation, search

Existing URI fragment schemes

Existing applications using proprietary fragmenting schemes

Google Video


See also the YouTubeTime specification



Metacafe / Dailymotion

Microsoft IIS 7.0

Fragment specification approaches

Temporal Fragment


Playing temporal fragments out-of-context

SMIL allows you to play only a fragment of the video by using the clipBegin and clipEnd atributes. How this is implemented, though, is out of scope for the SMIL spec (and for http-based URLs it may well mean that implementations get the whole media item and cut it up locally):

<video xml:id="toc1" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4"
       clipBegin="12.3s" clipEnd="21.16s" />

It it possible to use different time schemes, which give frame-accurate clipping when used correctly:

<video xml:id="toc2" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4"
       clipBegin="npt=12.3s" clipEnd="npt=21.16s" />
<video xml:id="toc3" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4"
       clipBegin="smpte=00:00:12:09" clipEnd="smpte=00:00:21:05" />

Adding metadata to such a fragment is supported since SMIL 3.0:

<video xml:id="toc4" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4"
       clipBegin="12.3s" clipEnd="21.16s">
        <rdf:.... xmlns:rdf="....">

Referring to temporal fragments in-context

The following piece of code will play back the whole video, and during the interesting section of the video allow clicking on it to follow a link:

<video xml:id="tic1" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4">
    <area begin="12.3s" end="21.16s" href="http://www.example.com"/>

It is also possible to do have a link to the relevant section of the video. Jumping to #tic2area will start the video at the beginning of the interesting section. The presentation will not stop at the end, however, it will continue.

<video xml:id="tic2" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4">
    <area xml:id="tic2area" begin="12.3s" end="21.16s"/>

Playing spatial fragments out-of-context

SMIL 3.0 allows playing back only a specific rectangle of the media. The following construct will play back the center quarter of the video:

<video xml:id="soc1" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4"
       panZoom="25%, 25%, 50%, 50%"/>

Assuming the source video is 640x480, the following line plays back the same:

<video xml:id="soc2" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4"
       panZoom="160, 120, 320, 240" />

This construct can be combined with the temporal clipping.

It is possible to change the panZoom rectangle over time. The following code fragment will show the full video for 10 seconds, then zoom in on the center quarter over 5 seconds, then show that for the rest of the duration. (The video may be scaled up or centered, or something else, depending on SMIL layout, but this is out of scope for the purpose of this investigation).

<video xml:id="soc3" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4"
       panZoom="0, 0, 640, 480" />
    <animate begin="10s" dur="5s" fill="freeze" attributeName="panZoom"
             to="160, 120, 320, 240 />

Referring to spatial fragments in-context

The following bit of code will enable the top-right quarter of the video to be clicked to follow a link. Note the difference in the way the rectangle is specified (left, top, right, bottom) when compared to panZoom (left, top, width, height). This is an unfortunate side-effect of this attribute being compatible with HTML and panZoom being compatible with SVG.

<video xml:id="tic1" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4">
    <area shape="rect" coords="50%, 0%, 100%, 50%" href="http://www.example.com"/>

Other shapes are possible, as in HTML. The spatial and temporal constructs can be combined. The spatial coordinates can be animated, as for panZoom.

Selecting tracks

SMIL has no way to selectively enable or disable tracks in the video. It only provides a general parameter mechanism which could conceivaby be used to comminucate this information to a renderer, but this would make the document non-portable. Moreover, I know of no such implementations.

<video xml:id="st1" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4">
    <param name="jacks-remove-track" value="audio" />

Named fragments

SMIL has no way to show named fragments in the base material out-of-context. It has no support for referring to named fragments in-context either, but it does have support for referring to "media markers" (named points in time in the media) if the underlying media formats supports them:

<video xml:id="nf1" src="http://homepages.cwi.nl/~jack/fragsmil/fragf2f.mp4">
    <area begin="nf1.marker(jack-frag-begin)" end="nf1.marker(jack-frag-end)"


A video is divided into VideoSegments. These Segments can be described by timestamp, region in the spatial domain or a combination of both.

MediaTimes are described using a MediaTimePoint and MediaDuration, which are the starting time and shot duration respectively. The MediaTimePoint has is build up as follows: YYYY-MM-DDThh:mm:ss:nnnFNNN (Y: year, M: month, D: day, T: a separation sign between date and time, h: hours, m: minutes, s: seconds, F: separation sign between n and N, n: number of fractions, N: number of fractions in a second). The MediaDuration is build up as follows: PnDTnHnMnSnNnF with nD number of days, nH number of hours, nM number of minutes, nS number of seconds, nN number of fractions and nF fractions per second. The temporal fragments can also be defined in Time Units or relative compared to a defined time.

<VideoSegment id=”video” >
             <VideoSegment id=”shot1”>

Selecting a spatial fragment of the video is also possible, using a SpatialDecomposition-element.

             <StillRegion id = “speaker”>
                           <FreeTextAnnotation> Journalist</FreeTextAnnotation>
                    <Mask xsi:type="SpatialMaskType">
                                         <Coords> 40 300, 105 210, …, 320 240</Coords>

The spatial fragment can be combined with temporal information thus creating a SpatialTemporalDecomposition-element.

                           <Mask xsi:type="SpatialMaskType">
                                                <Coords> 40 300, 105 210, …, 320 240</Coords>
      </ SpatialTemporalDecomposition >


There's no example of temporal fragmentation as SVG does not support it out-of-the-box. One can add a video to a scene (as can be seen in example 2). Although it is possible to add a foreign object within SVG wherein HTML5 video elements can be added. This is (at the moment) not a solution for temporal segmentation as HTML does not support it eather.

  <div xmlns="http://www.w3.org/1999/xhtml">
    <video src="myvideo.ogg"/>

Here's an example of a video that starts at second 5 and has a duration of 20 seconds:

<svg xmlns="http://www.w3.org/2000/svg" version="1.2" xmlns:xlink="http://www.w3.org/1999/xlink" width="320" height="240" viewBox="0 0 320 240">
 <desc>SVG 1.2 video example</desc>
   <video xlink:href="test.avi" volume=".8" type="video/x-msvideo"
        width="320" height="240" x="50" y="50" begin=”5s” dur=”20.0s” repeatCount="indefinite"/>


http://www.w3.org/TR/2006/CR-ttaf1-dfxp-20061116/#timing The Timed Text Authoring Format has 3 basic timing attributes: begin, end and duration. The semantic of these attributes are the same as in SMIL 2.1. Their values are build up like this:

 : clock-time
 | offset-time
 : hours ":" minutes ":" seconds ( fraction | ":" frames ( "." sub-frames )? )?
 : time-count fraction? metric
 : <digit> <digit>
 | <digit> <digit> <digit>+
minutes | seconds
 : <digit> <digit>
 : <digit> <digit>
 | <digit> <digit> <digit>+
 : <digit>+
 : "." <digit>+
 : <digit>+
 : "h"                 // hours
 | "m"                 // minutes
 | "s"                 // seconds
 | "ms"                // milliseconds
 | "f"                 // frames
 | "t"                 // ticks

Showing text1 from second 0 to 9 and text2 from second 5 to 9.

 <tt xml:lang="en" xmlns="http://www.w3.org/2006/04/ttaf1"  xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
   <div xml:lang="en">
      <p begin="0" dur="9">text1</p>
      <p begin="5" dur="4">text2</p>

Hyperlinking is a requirement of the standard but has not been satisfied yet.


CMML is a markup language for timed media, i.e. it creates temporally non-overlapping clippings of time that relate to a video or audio file. It is part of a whole architecture of client-server video communication around [RFC 3533 Ogg]. The Ogg files are required to have a skeleton, through which it is easy to identify the mime types of the tracks that consist an Ogg file. A temporal URI is used to specify subparts of videos.

Playing/Referring to temporal fragments out-of-context

A temporal URI is being used to play back temporal fragments in Annodex. The clip's begin and end are specified directly in the URI. When using "#" the URI fragment identfier, it is expected that the media fragment is played after downloading the complete resource, while using "?" URI query parameters, it is expected that the media fragment is extracted on the server and downloaded as a new resource to the client. Linking to such a resource looks as follows:

<a href="http://example.com/video.ogv#t=12.3/21.16" />
<a href="http://example.com/video.ogv?t=12.3/21.16" />

It it possible to use different time schemes, which give frame-accurate clipping when used correctly:

<a href="http://example.com/video.ogv?t=npt:12.3/21.16" />
<a href="http://example.com/video.ogv?t=smpte-25:00:12:33:06/00:21:16:00" />
<a href="http://example.com/audio.ogv?t=clock:20021107T173045.25Z" />

This can also be used, for example, in a HTML5 video or audio tag:

<video src="http://example.com/video.anx?t=npt:12.3/21.16"/>
<audio src="http://example.com/audio.anx?t=npt:12.3/21.16"/>

Creating metadata for such a clip would be simply done by referring to the clip in the annotation. For example, using RDF, it could lok like this:

@prefix event: <http://purl.org/NET/c4dm/event.owl#>.
@prefix ps: <http://purl.org/ontology/pop-structure/>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
:chorus1 a ps:Chorus;
    rdfs:label "First chorus";
    event:time [
       url "http://example.com/video.ogv#t=0/9";

Referring to temporal fragments in-context

To include outgoing hyperlinks into video, you have to define the time-aligned markup of your video (or audio) stream. For this purpose, Annodex uses CMML. Here is an example CMML file that can be used to include out-going hyperlinks next to or into Ogg streams. ("next to" means here that the CMML file is kept separate of the Ogg file, but that the client-side player knows to synchronise the two, "into" means that CMML is multiplexed as a timed text codec into the Ogg physical bitstream creating only one file that has to be exchanged). The following defines a CMML clip that has an outgoing hyperlink (this is a partial document extracted from a CMML file):

<clip id="tic1" start="npt:12.3" end="npt:21.16" title="Introduction">
 <a href="http://example.com/fish.ogv?t=5">Watch another fish video.</a>
 <meta name="author" content="Frank"/>
 <img src="fish.png"/>
  This is the introduction to the film Joe made about fish.

Not how there is also the possibility of naming a thumbnail, providing metadata, and giving a full description of the clip in the body tag.

Interestingly, you can also address into temporal fragments of a CMML file, since it is a representation of a time-continuous data resource:

<a href="http://example.com/sample.cmml?t=npt:4" />

And you can address into named temporal regions of a CMML file:

<a href="http://example.com/sample.cmml?id="tic1" />

Playing/Referring to spatial fragments out-of-context and in-context

CMML/Annodex/temporal URI has no means of specifying and addressing spatial fragments.

Selecting tracks

Tracks are an orthogonal concept to time-aligned annotations. Therefore, Xiph/Annodex have invented another way of describing/annotating these. It's only new (since January 2008) and is called: ROE (for Rich Open multitrack media Encapsulation). With ROE you would describe the composition of your media resource on the server. This file can also be downloaded to a client to find out about the "capabilities" of the file. It is however mainly used for authoring-on-the-fly. Depending on what a client requires, the ROE file can be used to find the different tracks and multiplex them together. Here is an example file:

   <link rel="alternate" type="text/html" href="http://example.com/complete_video.html" />
   <track id="v" provides="video">
    	<mediaSource id="v0" src="http://example.com/video.ogv" content-type="video/ogg" />
     	<mediaSource id="v1" src="http://example.com/theora.ogv?track=v1" content-type="video/theora" />
   <track id="a" provides="audio">
     <mediaSource id="a1" src="http://example.com/theora.ogv?track=a1" content-type="audio/vorbis" />
   <track id="c1" provides="caption">
     <mediaSource src="http://example.com/cmml1.cmml" content-type="text/cmml" />
   <track id="c2" provides="ticker">
     <mediaSource src="http://example.com/cmml2.cmml" content-type="text/cmml" />

This has not completely been worked through and implemented, but Metavid is using ROE as an export format to describe the different resources available as subpart to one media resource. We actually tried creating a SMIL profile here, but it didn't quite work out with the required elements and attributes and would have created a more talkative specification. Instead, we borrowed from SMIL what was appropriate.

And ROE is also used to create Skeleton in a final multiplexed file. Thus, the information inherent in ROE goes into the file (at least virtually) and can be used to extract tracks in a URI:

<video src="http://example.com/video.ogv?track=a/v/c1"/>

Named fragments

With CMML and ROE, all tracks and clips that are specified can be addressed through a URI - both in-context and out-of-context.

<video src="http://example.com/video.ogv?t=15.2/21.45&track=a/v" />

There are three different time specifications in CMML: Timestamp, Playbacktime and UTCtime. Timestamp is a name-value pair which defines a time point. The default time scheme is ntp (normal playback time, as used in the RTSP standard), but smpte and utc (universal time code) are possible. Playbacktime is equal to Timestamp but without the support for the UTC specification. UTCtime is a data type with support for the UTC specification only.

npt-spec    =  "npt:" npt-time
npt-time    =  npt-sec | npt-hhmmss
npt-sec     =   1*DIGIT [ "." *DIGIT ]
npt-hhmmss  =   npt-hh ":" npt-mm ":" npt-ss [ "." *DIGIT ]
npt-hh      =   1*DIGIT
npt-mm      =   1*2DIGIT
npt-ss      =   1*2DIGIT
smpte-spec  = smpte-type ":" smpte-time
smpte-type  = "smpte-24" | "smpte-24-drop" | "smpte-25" |
             "smpte-30" | "smpte-30-drop" | "smpte-50" |
             "smpte-60" | "smpte-60-drop"
smpte-time  =  1*2DIGIT ":" 1*2DIGIT ":" 1*2DIGIT [ ":" 1*2DIGIT ]
utc-spec    = "clock:" utc-time
utc-time    =   utc-date "T" utc-hhmmss "Z"
utc-date    =   8DIGIT
utc-hhmmss  =   6DIGIT [ "." *DIGIT ]


MPEG-21 Part 17: Fragment Identification of MPEG Resources

Four different schemes are specified in MPEG-21 Part 17 to address parts of media resources:

  • ffp()
  • offset()
    • applicable to any digital resource.
    • identifies a range of bytes in a data stream.
    • similar functionality as the HTTP byte range mechanism
    • example:
  • mp()
    • applicable for media resources whose Internet media type (or MIME type) is equal to audio/mpeg, video/mpeg, video/mp4, audio/mp4, or application/mp4.
    • provides two complementary mechanisms for identifying fragments in a multimedia resource:
    • a set of so-called axes (i.e., temporal, spatial or spatiotemporal) which are independent of the coding/container format.
    • a hierarchical logical model of the resource. Such a logical model is dependent on the underlying container format (e.g., audio CD contains a list of tracks). The structures defined in these logical models are accessed with a syntax based on XPath.
    • for the temporal axis, the following time schemes are supported: NPT, SMPTE, MPEG-7, and UTC.
    • for the spatial axis, the following shapes are supported: ellipse, polygon, and rectangle
    • support for moving regions is also present
    • examples:
  • mask()
    • applicable for media resources whose Internet media type (or MIME type) is equal to video/mp4 or video/mpeg.
    • address a binary (1-bit deep) mask defined in a resource. Note that this mask is meant to be applied to a video resource and that the video resource may itself be the resource that contains the mask.
    • example:

Hierarchical combinations of addressing schemes are also possible. The '*' operator is used for this purpose. When two consecutive pointer parts are separated by the '*' operator, the fragments located by the first pointer part (to the left of the '*' operator) are used as a context for evaluating the second pointer part (to the right of the '*' operator). The following example addresses the first 50 seconds in a bitstream which is first located using its item_ID: http://www.example.com/myfile.mp4#ffp(item_ID=myBitstream)*mp(~time('npt','0','50'))

HTML5 video

The Date and Time Datatypes. Date is represented as YYYY-MM-DD with Y year, M month, and D day Time is representend as hh:mm(:ss) with h hours, m minutes, s seconds. S can be omitted and can be a floating point. Date and Time can be combined using a capital T: YYYY-MM-DDThh:mm(:ss) If necessary a time zone can be added. If UTC is used adding a Z is sufficient (YYYY-MM-DDThh:mm(:ss)Z). Adding the time zone is done by adding a + or – and difference between UTC and the time zone, if the time difference contains minutes they are separated from the hours by a “:” e.g. YYYY-DDThh:mm(:ss)+7:30, YYYY-DDThh:mm(:ss)-6


Within TV-Anytime programmes can be divided in segments. Segmentation refers to the ability to define, access and manipulate temporal intervals (i.e. segments) within an AV stream. By associating metadata with segments and segment groups, it is possible to restructure and re-purpose an input AV stream to generate alternative consumption and navigation modes.

Segment - A segment is a continuous fragment of a programme. A particular segment can belong to a single programme only, but it can be a member of multiple segment groups.

Segment Group - denotes a collection of segments that are grouped together, for a particular purpose or due to a shared property. A segment group can contain segments, or other segment groups.

The following element and complex type defines a segment within TV-Anytime:

 <complexType name="SegmentInformationType">
     <element name="ProgramRef" type="tva:CRIDRefType" minOccurs="0"/>
     <element name="TimeBaseReference" type="tva:TimeBaseReference" minOccurs="0"/>
     <element name="Description" type="tva:BasicSegmentDescriptionType" minOccurs="0"/>
     <element name="SegmentLocator" type="tva:TVAMediaTimeType" minOccurs="0"/>
     <element name="KeyFrameLocator" type="tva:TVAMediaTimeType" minOccurs="0" maxOccurs="unbounded"/>
     <element name="OtherIdentifier" type="mpeg7:UniqueIDType" minOccurs="0" maxOccurs="unbounded"/>
   <attribute name="segmentId" type="tva:TVAIDType" use="required"/>
   <attributeGroup ref="tva:fragmentIdentification"/>
   <attribute name="metadataOriginIDRef" type="tva:TVAIDRefType" use="optional"/>
   <attribute ref="xml:lang" use="optional"/>

TimeBaseReference is an optional element which when present signals the use of a different Timebase to that signalled within the segment Group for which this segment is a member of, and is of the type TimeBaseReference:

 <complexType name="TimeBaseReferenceType">
       <element name="MediaTimePoint" type="mpeg7:mediaTimePointType"/>
       <element name="MediaRelIncrTimePoint" type="mpeg7:MediaRelIncrTimePointType"/>
   <attribute name="timebaseId" type="string"/>
  • MPEG-7's mediaTimePointType describes a media time point using Gregorian date and day time without Time Zone Difference (TZD) as YYYY-MM-DDThh:mm:ss:nnnFNNN [Y: Year, can be a variable number of digits, M: Month, D:Day, h: hour, m: minute, s: second, n: number of fractions being any number between 0 and NNN-1, N: number of fractions of one second which are counted by nnn. having an arbitrary number of digits not limited to three, also delimiters for the time specification (T) and the number of fractions of one second (F) are used]
  • MPEG-7's MediaRelIncrTimePointType describes a media time point relative to a time base counting time units as specified for the MediaIncrDuration datatype hereafter. If for instance an addressing of a frame by counting frames is needed MediaRelIncrTimePoint datatype can be used referring to the starting time of the shot or the whole video encoded within the media as a time base.

SegmentLocator locates the segment within a programme (instance) in terms of start time and duration (optional). If the duration is not specified, the segment ends at the end of the programme. It is of type TVAMediaTimeType.

 <complexType name="TVAMediaTimeType">
       <element name="MediaRelTimePoint" type="mpeg7:MediaRelTimePointType"/>
       <element name="MediaRelIncrTimePoint" type="tva:TVAMediaRelIncrTimePointType"/>
     <choice minOccurs="0">
       <element name="MediaDuration" type="mpeg7:mediaDurationType"/>
       <element name="MediaIncrDuration" type="mpeg7:MediaIncrDurationType"/>
 <complexType name="TVAMediaRelIncrTimePointType">
     <restriction base="mpeg7:MediaRelIncrTimePointType" >
       <attribute name="mediaTimeUnit" type="mpeg7:mediaDurationType" use="optional" default="PT1N1000F"/>
  • MPEG-7's MediaRelTimePointType describes a media time point relative to a time base using a number of days and day time without specifying a difference of the TZD. It consists of a mediaTimeBase (a time base of the relative time specification by referring to an element of the above MediaTimePoint datatype) and a MediaTimeOffsetType (a time offset with respect to a time base derived from the basicDuration datatype as defined in ISO/IEC 15938-2 without specifying adifference of the TZD and by restricting the fraction (f) to integer numbers instead of decimals; the lexical representation equals mediaDurationType hereafter).
  • MPEG-7's mediaDurationType describes the duration of a time interval according to days and day time of a notion of time encoded in the media without specifying a difference in the TZD. The time interval is defined as a half open time interval with the closed end being at the beginning. A simpleType representing a duration/period (P) in time using a lexical representation of days (nD), time duration and a fraction specification (TnHnMnSnN) including the specification of the number of fractions of one second (nF): (-)PnDTnHnMnSnNnF
  • MPEG-7's MediaIncrDurationType describes the duration of a media time period counting time units. Such a time unit can for example be the time increment of the timestamps of successive frames in a video stream. The duration is then specified by the number of these time units. The lexical representation equals above mediaDurationType.

Spatial Fragment

Image Maps

  • Definitions:
  • Client-side image maps:
    • The MAP element specifies a client-side image map. An image map is associated with an element via the element's usemap attribute. The MAP element content model includes then either AREA elements or A elements for specifying the geometric regions and the link associated with them
    • Possible shapes are: rectangle (rect), circle (circle) or arbitrary polygon (poly)
    • Example:
<img src="image.gif" usemap="#my_map"/>
<map name="my_map">
  <a href="guide.html" shape="rect" coords="0,0,118,28">Access Guide</a> |
  <a href="shortcut.html" shape="rect" coords="118,0,184,28">Go</A> |
  <a href="search.html" shape="circle" coords="184,200,60">Search</A> |
  <a href="top10.html" shape="poly" coords="276,0,276,28,100,200,50,50,276,0">Top Ten</A>


<map name="my_map">
  <area href="guide.html" alt="Access Guide" shape="rect" coords="0,0,118,28">
  <area href="search.html" alt="Search" shape="rect" coords="184,0,276,28">
  <area href="shortcut.html" alt="Go" shape="circle"coords="184,200,60">
  <area href="top10.html" alt="Top Ten" shape="poly" coords="276,0,276,28,100,200,50,50,276,0">
  • Server-side image maps:
    • When the user activates the link by clicking on the image, the screen coordinates are sent directly to the server where the document resides. Screen coordinates are expressed as screen pixel values relative to the image. The user agent derives a new URI from the URI specified by the href attribute of the A element, by appending ? followed by the x and y coordinates, separated by a comma.
    • Example:
<a href="http://www.example.com/images">
  <img src="image.gif" ismap alt="target"/>

For instance, if the user clicks at the location x=10, y=27 then the derived URI is: http://www.example.com/images?10,27

  • Discussion:
    • It is recommended to indicate which regions will have which behavior using a rollover effect
    • When using an editor capable of layering images such as Photoshop or GIMP, sections of the image may be cut and pasted in place over a copy of the image which has reduced brightness. These highlighted areas will stand out to the user.
    • The major issue that needs to be remembered is that image maps can't be indexed by most search engines.


  <Image id="image_yalta">           <!-- whole image -->
      <StillRegion id="SR1">          <!-- still region -->
            <Box>14.64 15.73 161.62 163.21</Box>

Scalable Vector Graphics

  <svg xmlns:svg="http://www.w3.org/2000/svg"
    <g id="layer1">
      <"image" id="image_yalta" x="-0.34" y="0.20" width="400" height="167" 
      <"rect" id="SR1" x="14.64" y="15.73" width="146.98" height="147.48" 

Note: this would require the image is available as SVG?