Use Cases Discussion
- 1 Generic UC's (i.e., independent of the media format)
- 2 Image UC's
- 3 Audio UC's
- 4 Video UC's
Generic UC's (i.e., independent of the media format)
Media Anchor Definition UC
Interested people: Silvia
People want to create annotations for media resources on the Web. I will generalise "annotations" to mean any informative text that relates to media resources.
Use Cases for annotations:
- Display of more detailed information for media resources
- Search for specific fragments of interest
- Automation for machine interaction with content
- Accessibiliy, e.g. captions, audio descriptions
- Internationalisation, e.g. subtitles
Such annotations can be attached:
- either to the full media resource
- or to fragments of the media resource, such as time-aligned text or metadata for image regions.
Annotations that are attached to media fragments are of interest to this working group, since we may want to address such fragments through a URI addressing scheme.
When the publisher of a media resource provides such annotations, they also have to provide a substructure to the resource that defines the boundaries of the fragments to which the annotations are attached.
If the publisher also provides globally unique identifiers for the fragments in his substructure, these identifiers can be used to address the fragments (by "name").
This enables the creation of URIs that directly address media fragments and their annotations.
Advantages of such a structure: A defined substructure of a media resource enables
- attachment of annotations, such as captions, subtitles, audio descriptions, metadata, descriptions, RDFa etc.
- direct access to points of interest defined by the publisher, e.g. for featuring fragments
- accessibility to the media resource, e.g. through tabbing through the structure and where annotations are available providing an API to screen readers or braille devices to directly access the text
- search through a media resource and direct display of and access to the named media fragments
- browsing through a media resource by defined substructure; if no such structure is given by the publisher, pagination of a long resource needs to be done by slicing the resource in a different way to turn a large resource into an accessible resource
- bookmarking of media fragment hyperlinks
- sequencing of media fragments, e.g. by image stitching, slide show, playlist, audio sequencing, or video mashups
- looping over given media fragments in a display
- recompositing of defined fragments to a new multimedia experince without copying of the media data itself, e.g. using SMIL
Further, if the media resource provides for dynamic choice of annotation tracks as requested by a user agent from a origin server, media variations can be served as a result of a media fragment URI request. This adds an additional structure on top of the temporal, spatial and spatio-temporal structures to the media resource which allows further detailed addressing and selection of relevant content. fragment linking options
This is particularly relevant since annotations may be attached to one track only, e.g. audio descriptions to the audio track, or captions to the video track.
- Accessibility/Video a11y requirements
- New HTML5 features and Silvia's blog post
- Possible fragment links - draft drawing by Silvia
This use case is related to our twin Media Annotation Working Group.
Media Delivery UC
Interested people: Davy
When there exists a global identifier system for addressing media fragments, a next logical step is a web server able to deliver the correct content for a given fragments identifier. This way, only the requested part of the media resource is send from the server to the client, which is particularly important for networks where bandwidth is expensive and computing resources on the device are rare (e.g. mobile phone), but also where people are not prepared to wait until the requested fragment has finally been buffered. On the other hand, fragment addressing could also be meaningful on the client-side. In this scenario, a media resource is (partially) sent to the client, which will load the received media stream. Now, the client is able to jump locally within the media resource, hereby avoiding contacting the web server. This way, needless processing on the server-side is avoided.
This use case is not just a server- or client-side issue; the media delivery of media fragments has also impact on the network between server and client (e.g., proxy servers). The latter implies that a protocol is needed to handle the media delivery task. Existing protocols such as HTTP and RTSP should be capable to serve for this purpose. For instance, using the HTTP protocol, new headers could be defined for server identification and caching purposes. For further information regarding media delivery issues in general, see caches and proxies and protocols.
Two specifications providing an URI-like syntax to address a media fragment currently exist: the Temporal URI specification and MPEG-21 part 7: Fragment Identification of MPEG Resources. In the next paragraphs, an overview is given of their provisions for media delivery (i.e., client vs. server implementation, defined protocols, support for caches and proxies).
MPEG-21 part 17 specifies a normative syntax for URI fragment identifiers to be used for addressing parts of MPEG resources (see also MPEG-21 part 7: Fragment Identification of MPEG Resources). However, it does not provide any information about a protocol for the media delivery of such media fragments. Further, the ‘#’ character is used to identify fragments. This implies that servers interpreting URI fragments do not have notion of these fragments. Hence, in a trivial scenario, the full media resource is delivered, after which the URI fragment is interpreted by the client. There are also more exotic scenarios possible, specific for a particular delivery format, as described by Dave. In this example, HTTP byte ranges are used, in combination with MP4 headers to deliver media fragments in a more efficient way than the trivial one.
Temporal URI specifies a syntax for addressing time intervals within time-based Web resources through URI queries and fragments. Hence, both the '?' and '#' characters can be used to address a temporal fragment of a media resource. In case the '?' character is used, the URI is interpreted on the server, otherwise (in case the '#' character is used) the URI is interpreted on the client implying that the full media resource needs to be downloaded first. In case the '?' character is used, protocols such as HTTP and RTSP can be enhanced to handle the delivery of the requested media fragments in a correct way. The Temporal URI specification provides an implementation with HTTP. Three new HTTP headers are specified to support HTTP proxy servers and the identification of enabled HTTP servers. Furthermore, a proposal for an RTSP implementation is also available.
HTTP and RTSP protocols
As discussed by Silvia, making media fragment resources cachable on the Web through the HTTP protocol requires the use of a four-way handshake:
- The UA contacts the server and requests the desired fragments.
- The server determines which byte ranges correspond to the requested fragments and returns them to the UA.
- The UA requests the desired fragments by making use of HTTP byte ranges (based on the received byte ranges).
- The server sends the requested bytes to the UA.
The RTSP protocol is used to stream media resources to the UA. This specification provides a Range header, which could be compared to the HTTP byte range mechanism. However, it is not possible to specify a range in terms of bytes. Within RTSP, time is used to specify a range (e.g., npt, smpte, and clock schemes). Therefore, it is not possible to use the four-way handshake as applied for the HTTP protocol. This implies that only temporal fragments will be supported by the RTSP protocol (to enable caching), caching of spatial fragments will not be possible.
We wonder whether we should not look also at P2P protocols such as: tvu, sop (sopcast), tvants, pplive, bittorrent etc. See the P2PTV article on wikipedia.
- Note: Xiaojun Heiy el al. claims that pplive' is a proprietary protocol. SopCast and BitTorrent seem to be the only open ones (to be confirmed)
Media Linking, Bookmarking and Playlist UC
Interested people: Michael
Media Search UC
Interested people: ?
When it is possible to link to and deliver just a fragment from a media file/stream, search interfaces can become much more context-sensitive about video. For example, for a video that has time-aligned textual annotation (e.g. captions, subtitles, transcripts), text search can retrieve the segment during which a specific search term was being mentioned/discussed.
An example of this is demonstrated in Metavid: http://au.youtube.com/watch?v=aX6sARniTLo - about 2:08min in.
Another example is demonstrated by the NY Times, that put online the debates of the US president election, and a full transcript synchronized with the video. One can search and jump to arbitrary fragments of the video, see: http://elections.nytimes.com/2008/president/debates/vice-presidential-debate.html
The Google Audio Indexing beta tool is also relevant as it provides an automatic transcript and full text index of a video.
Metavid also indexes the videos from the US congress and proposes a full searchable and browsable transcript of the video. The UI is based on an extension of SemanticMediaWiki and the TemporalURI for addressing the video fragments.
Media Browsing UC
Interested people: ?
When dealing with particularly long video content, the ability to address and deliver media fragments is necessary to be able to cope with the duration of the file better and present it as an appropriately paginated video stream. The fragments can be either determined through pre-determined "named" segments, or dynamically through chopping-up by time interval.
Further, media can generally be browsed (watched) more easily and with more focus when there is a semantic segmentation into sensible fragments. Metavid shows such an example: http://au.youtube.com/watch?v=aX6sARniTLo, see 0:33sec into the video.
A fragmentation of a media file enables e.g. a tab-based browsing through a video or audio file - something that will help accessibility of audio and video for disabled people of varying grades, in particular when textual annotations are included.
In the context of images, spatial regions can be identified. Regions could initially have a rectangular shape (e.g., identified by a (x,y) co-ordinate, a width & a height or by 2 co-ordinates (x1,y1) & (x2,y2)). However, it should also be possible to identify non-rectangular shapes.
The HTML image map specification is a good specification for identifying arbitrary regions in images. Yet, there are no URIs at this point that can point to an image map that has been given a name (e.g. through the id tag), or that fully identify the region in the URI.
Interested people: Erik
A series of photos and/or regions-of-interest within that series of photos (of a particular journey, for instance) can be linked together temporally or certain regions-of-interest of a series of photos can be linked together spatially.
Temporally this means that an application would be a slide show made up of image fragments from different servers - a kind of image playlist of fragment URIs. There may be an issue with size.
Spatially this means that an application would be the creation of a "stitched" image from images where regions are brought together from different servers. There may be an issue with smoothing and/or positioning.
Map Application & Multi-Resolution Images UC
Interested people: Raphael
Addressing fragments in audio streams is mainly in the temporal direction. More specifically, a temporal audio segment can be identified by means of a starting timestamp and an ending timestamp or duration.
However, one could wonder if there also exist a spatial direction in the context of audio. For instance, a specific audio channel could be identified. Maybe we need to talk about addressing three different dimension of media: temporal, spatial, and tracks. And if we want to be more detailed, within tracks, we can identify audio channels, or color spaces (RGB channels / YUV channels).
Some audio resources have more than one channel (Dolby TrueHD, for example, supports upto 8 (7.1) channels). Let's suppose that for a Karaoke CD/DVD we only need two: one for the music and one for the singing voice. In that case, within "karaoke" mode one only wants to hear the music, thus also needing some "pseudo spatial" addressing within audio resources.
Both spatial and temporal addressing of media fragments is possible within the context of video. One could also imagine a combination of these two dimensions (e.g., address a spatial region in a video stream from timestamp x to timestamp y).
Some video streams (e.g. DVD titles) may be viewed in different versions based on user's preferences: different soundtracks, different subtitles (if any), sometimes different angles (e.g. live music DVDs). When specifying a fragment of those streams, it should be possible to specify which variation is considered. However, it should also be possible for a fragment to be under-specified with regards to variations: sometimes, one would like to address the fragment with a specific soundtrack and angle (e.g. "the lips do not move accordingly with the voice"), but sometimes a fragment does not depend on a particular variation (e.g. "the kiss scene"). It should be noted that this kind of consideration is somewhat present in the "Audio UC's" section ... however, I think the problem is even more complicated with video. From http://lists.w3.org/Archives/Public/public-media-fragment/2008Sep/0022.html
Moving Point Of Interest UC
Within a TV-production (shot in HD), one should be able to define a region-of-interest (spatial adressing) per frame, making automatic intelligent reframing possible when an iPod-version of that same TV-production is feasible. Needless to say that in a next frame (temporal adressing) this region-of-interest has moved and probably also increased or decreased in size. A region should be uniquely identifiable overtime and able to shrink and/or grow.
This is particularly important for marketing / product placement. People like to "buy the dress she wears" or "check out the car he's driving". A specification of a moving image map in videos can enable clickable regions that can create outgoing hyperlinks for videos.
Also, it is possible to make these regions addressable through a URI.
Video Browser/Player UC
Interested people: Silvia
When dealing with large media resources, special challenges occur. Server-side media fragment URIs are particularly interesting in this case. (see Media annotations UC above).
Search, browsing, and direct access are of particular interest.
Another point of interest is the dynamic creation of media samples / previews.
For images that may be a low-resolution image as a representation of the high resolution one.
For audio it may be the extraction of the chorus of a music piece or just a highlight out of a speech.
For video, thumbnails are traditionally used as representative displays. Thumbnails can be used to represent media fragments or full videos. Dynamic (server-side) creation of thumbnails through a URI mechanism is highly of interest.
Also, the extraction of a video preview, i.e. a short video extract, is a typical way to create a media sample of videos.
- thumbnails can be represented through addressing of a time point through a media fragment URI and is therefore relevant to this WG
- previews of audio and video files by extraction of a segment can be represented through addressing of a media fragment URI and is therefore relevant to this working group
- image previews through different resolutions are outside the scope of this WG because this requires creating a transformation of the resource; we are however considering whether it makes sense to define an addressing scheme for images that have a preview image encoded inside themselves; similarly this applies to coverart inside mp3 type files