This Wiki page is edited by participants of the HTML Accessibility Task Force. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Task Force participants, WAI, or W3C. It may also have some very useful information.

Media Multitrack Media API

From HTML accessibility task force Wiki
(Redirected from Meida Multitrack API)
Jump to: navigation, search

Note: the discussion here is superceded by http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Change_Proposals_Summary

Multitrack Media API

Issue: http://www.w3.org/html/wg/tracker/issues/152 Feedback due by 21st February

Bug: http://www.w3.org/Bugs/Public/show_bug.cgi?id=9452

Use case: Audio and video often have more than one audio and one video track. In particular we often have sign language tracks, audio description tracks, dubbed audio tracks, but also alternate viewing angles and similar additional or alternative tracks to the main a/v tracks. Sometimes such tracks are an inherent part of the main media resource, in other instances they are separate by synchronised resources. Currently there is no means in HTML5 to use such multitrack media resources. You can check out http://www.longtailvideo.com/support/addons/audio-description/15136/audio-description-reference-guide for an example of audio descriptions in Flash that we want to replicate in HTML5.

Some example uses: http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_Rendering

Requirements: We need a means to make use of such multitrack media content in HTML5. 1. We need a means to provide multitrack media resources to the Web page where the multiple tracks come in multiple resources. 2. We need to define a JavaScript API that lets us control the display/playback of individual video/audio tracks both out of in-band multitrack media resources and out of constructed multitrack resources from multiple external files.

Side conditions:

  • we want to achieve a consistent API between in-band and external audio/video tracks.
  • we want to be able to control the relative volume of a additional audio track and the positioning of video tracks as picture-in-picture or side-by-side viewports.
  • we don't want to screw with the source selection algorithm which is already complicated enough as is.
  • we want to support the most important and most scalable use case natively and encourage that as the main means to author content, namely by providing a main resource and additional tracks to complete that presentation. Other use cases should not get dedicated markup and can be satisfied through special JavaScript or server software.
  • there is no new markup needed for in-band, just a JavaScript API and notes on how to render.
  • we assume tracks are created in such a fashion that the can add to each other, not replace each other. This restricts authoring, but if somebody wants to do replacement, they can always define alternative audio and video elements and activate them through JavaScript.
  • we assume that the alternate audio/video tracks are provided as a single file with approximately the same duration as the main resource and that synchronisation between them implies synchronizing their starting points and playback speed. Content in alternate audio/video tracks that goes beyond the duration of the main resource will be chopped off and never play back.
  • situations where we have small snippets of audio that are synchronized to particular times in the video (as shown below) are not considered here. They can right now be solved by using WebVTT with a @kind="metadata", with a hyperlink to the media resource(s) in each synchronized cue, and with JavaScript that will interpret this content and play back the links at the right time. This approach also allows for providing textual descriptions at the same time as recorded descriptions in sync in a WebVTT resource as may be in use with a Braille device. This approach can also provide mixed text and audio descriptions in cases where, e.g. proper names would not be read out correctly by a screen reader.
-- silence from 0s-15s
--  video description #1 from 15s-20s
--  silence from 20s-30s
-- video description #2 from 30s-35s
-- silence from 35s-45s
-- video description #3 from 45s-50s
-- silence from 50s-60s

Possible solutions to the markup challenge:

(1) No markup in HTML - leave to a manifest file

For example synchronizing external audio description and sign language video with main video:

<video id="v1" poster=“video.png” controls>
 <source src=“manifest_webm” type=”video/webm”>
 <source src=“manifest_mpg” type=”video/mp4”>
 <track kind=”captions” srclang=”en” src=”captions.vtt”>
</video>

In this approach we do not distinguish between the markup for a multitrack media resource where the tracks are provided in-band or externally. Instead we expect this information to be available in some kind of manifest file which the browser will parse and expose to the Web page as though all the tracks are available in-band.

Advantages:

  1. + There is no need to define any new markup (i.e. elements and attributes) for it, just a JavaScript API.
  2. + This could work well with adaptive streaming which probably also needs a manifest file.
  3. + Since a manifest file is restricted to a certain content type, this also makes it easy to provide the correctly encoded alternative media resources with the correct main resource (the "codec" issue).
  4. + The synchronization is completely handed to the browser and it will make sure that start time and progress line up.
  5. + This approach also allows for the introduction of snippet synchronization rather than fully synchronized audio descriptions, since the basis of adaptive streaming is a collection of snippets.

Disadvantages:

  1. - It makes it non-obvious to HTML if there is an audio description track/sign language track/other track available (though that is also the case for in-band, too) (the "discoverability" issue).
  2. - There is a need to define the manifest file format to deal with multiple tracks.
  3. - It is impossible(?) to style the tracks through CSS, e.g. make one small and an overlay onto video etc. For rendering we have to rely on the browser.

JavaScript API: We require a means to expose the list of available tracks for a media resource in JavaScript and a means to activate/deactivate the tracks. For example:

interface MediaTrack {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;

}
interface HTMLMediaElement : HTMLElement {
  [...]
  readonly attribute TextTrack[] textTracks;
  readonly attribute MediaTrack[] mediaTracks;
};

With such an interface, we can e.g. use the following to activate/deactivate the first English audio description track:

for (i in video.mediaTrack) {
  if (video.mediaTrack[i].kind == "description" && video.mediaTrack[i].language == "en") {
    video.mediaTrack[i].mode = SHOWING;
    break;
  }
}


Rendering:

There is only one video element on the page, but potentially several video tracks for this video.

  • we probably need to render all of the tracks into the same video viewport with only one control, e.g. tiled, picture-in-picture, or as a scrollable list on the side of the main video, see http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_Rendering; this could be specified as a CSS style on the video element, video.style.tracks = {tiled, pip, list} with a default of "tiled".
  • we need to be able to address the individual video tracks through CSS and be able to change some CSS styles such as background, border, width, height, opacity, transitions, transforms and animations. The use of a pseudo-element to address the different video tracks (and audio tracks for that matter) is probably necessary: something like ::mediaTrack(id) with id being the index in the MediaTrack[] list.

Then we can do the following:

video {
  tracks: pip;
}
video::mediaTrack(2) {
  width: 200px;
  opacity: 0.7;
}

We probably also need to add the list of available tracks to the menu for track selection and probably make it possible to close individual tracks (e.g. through and "X" in a corner).

(2) Overload <track> inside <video>

For example synchronizing external audio description and sign language video with main video:

<video id="v1" poster=“video.png” controls>
 <!-- primary content -->
 <source src=“video.webm” type=”video/webm”>
 <source src=“video.mp4” type=”video/mp4”>
 <track kind=”captions” srclang=”en” src=”captions.vtt”>

 <!-- pre-recorded audio descriptions -->
 <track src="audesc.ogg" kind="descriptions" type="audio/ogg" srclang="en" label="English Audio Description">
 <track src="audesc.mp3" kind="descriptions" type="audio/mp3" srclang="en" label="English Audio Description">

 <!-- sign language overlay -->
 <track id="signwebm" src="signlang.webm" kind="signings" type="video/webm" srclang="asl" label="American Sign Language">
 <track id="signmp4" src="signlang.mp4" kind="signings" type="video/mp4" srclang="asl" label="American Sign Language">
</video>

In this approach we add a @type attribute to the <track> element, allowing it to also be used with external audio and video and not just text tracks.

Advantages:

  1. + The HTML markup clearly exposes what tracks are available (the "discovery" issue).
  2. + All types of external tracks are perceived to be handled in the same way, no matter if text, video or audio.

Disadvantages:

  1. - The given example uses replication of <track> elements for alternative codec files (the "codec" issue). It would also be possible to introduce <source> elements under <track> to cover this need. Neither of these options seems particularly elegant.
  2. - It is confusing media and text tracks in the same interface, which makes it hard to read and author and parse and style through CSS, e.g. if we would like to style all text tracks.
  3. - We lose all the functionality that is available to audio and video resources in the <audio> and <video> elements, such as setting the volume, width, and height.
  4. - Since the <track> element now isn't a full media element, it does not expose the features of a media element such as error states, seeking position, controls, muting, their own volume etc. This may also be an advantage...
  5. - It is necessary to define a default rendering means for the child a/v tracks. This may be overriden by CSS.

JavaScript API: We would reuse the TextTrack API for these types of tracks, too, and just introduce some further @kind such as signlanguage or audiodescription. However, a part of the TextTrack API - the elements and attributes dealing with cues - will be irrelevant and we only need these parts:

interface TextTrack {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;

  const unsigned short NONE = 0;
  const unsigned short LOADING = 1;
  const unsigned short LOADED = 2;
  const unsigned short ERROR = 3;
  readonly attribute unsigned short readyState;
  readonly attribute Function onload;
  readonly attribute Function onerror;
}

With such an interface, we can e.g. use the following to activate/deactivate the first English audio description track:

for (i in video.track) {
  if (video.track[i].kind == "audiodescription" && video.track[i].language == "en") {
    video.track[i].mode = SHOWING;
    break;
  }
}


Rendering:

Again, there is only one video element on the page, so we probably need to render all of the tracks into the same video viewport with only one control.

Again, there is the question of layout, which could be done as, e.g. tiled, picture-in-picture, or as a list on the side of the main video, see http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_Rendering; this could be specified as a CSS style on the video element, video.style.tracks = {tiled, pip, list} with a default of "tiled".

Since the individual tracks are explicitly marked-up, there is probably no need for a pseudo-selector (although... what about in-band tracks...).

We can do the following:

video {
  tracks: pip;
}
track.signwebm, track.signmp4 {
  width: 200px;
  opacity: 0.7;
}

(3) Introduce <audiotrack> and <videotrack>

Instead of overloading <track>, one could consider creating new track elements for audio and video, such as <audiotrack> and <videotrack>.

This allows keeping different attributes on these elements and having audio / video / text track lists separate in JavaScript.

Also, it allows for <source> elements inside media tracks, e.g.:

<video id="v1" poster=“video.png” controls>
 <source src=“video.webm” type=”video/webm”> <!-- primary content -->
 <source src=“video.mp4” type=”video/mp4”> <!-- primary content -->
 <track kind=”captions” srclang=”en” src=”captions.vtt”>
 <audiotrack kind=”descriptions” srclang=”en”> <!-- pre-recorded audio descriptions -->
   <source src=”description.ogg” type=”audio/ogg” label="English Audio Description">
   <source src=”description.mp3” type=”audio/mp3”>
 </audiotrack>
 <videotrack kind="signings" srclang="asl" label="American Sign Language">  <!-- sign language overlay -->
   <source src="signing.webm" type="video/webm">
   <source src="signing.mp4" type="video/mp4">
 </videotrack>
</video>

It is possible to just have the @src on the media track element or in a <source> element if there is only one resource. This turns the media track into its own media element. We thus get a replication of some of the audio / video functionality from the <audio> and <video> elements with a further question of how much recursive definition do we allow.

Advantages:

  1. + The HTML markup clearly exposes what tracks are available (the "discovery" issue).
  2. - It keeps a clear separation between audio, video and text tracks, which makes it easier to read and author and parse and style through CSS, e.g. if we would like to style all text tracks.

Disadvantages:

  1. - Every media track has the full functionality of a media resource, but we don't want some of that functionality, such as sparate seeking, separate controls.
  2. - It is necessary to define a default rendering means for the child a/v tracks. This may be overriden by CSS.

JavaScript API:

interface VideoTrack : HTMLVideoElement {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;

}
interface AudioTrack : HTMLAudioElement {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;

}
interface HTMLMediaElement : HTMLElement {
  [...]
  readonly attribute TextTrack[] textTracks;
  readonly attribute VideoTrack[] videoTracks;
  readonly attribute AudioTrack[] audioTracks;
};

With such an interface, we can e.g. use the following to activate/deactivate the first English audio description track:

for (i in video.audioTracks) {
  if (video.audioTracks[i].kind == "descriptions" && video.audioTracks[i].language == "en") {
    video.audioTracks[i].mode = SHOWING;
    break;
  }
}

Rendering:

Again, there is only one video element on the page, so we probably need to render all of the tracks into the same video viewport with only one control.

Again, there is the question of layout, which could be done as, e.g. tiled, picture-in-picture, or as a list on the side of the main video, see http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_Rendering; this could be specified as a CSS style on the video element, video.style.tracks = {tiled, pip, list} with a default of "tiled".

Since the individual tracks are explicitly marked-up, there is probably no need for a pseudo-selector (although... what about in-band tracks...).

We can do the following:

video {
  tracks: pip;
}
videotrack {
  width: 200px;
  opacity: 0.7;
}

(4) Introduce a <par>-like element

The fundamental challenge that we are facing is to find a way to synchronise multiple audio-visual media resources, be that from in-band where the overall timeline is clear or be that with separate external resources where the overall timeline has to be defined. Then we are suddenly not talking any more about a master resource and auxiliary resources, but audio-visual resources that are equals. This is more along the SMIL way of thinking, which is why we called this section the "<par>-like element".

An example markup for synchronizing external audio description and sign language video with a main video could be something like:

<par>
 <!-- primary content -->
 <video id="v1" poster=“video.png” controls kind="main">
   <source src=“video.webm” type=”video/webm”>
   <source src=“video.mp4” type=”video/mp4”>
   <track kind=”captions” srclang=”en” src=”captions.vtt”>
 </video>
 <!-- pre-recorded audio descriptions -->
 <audio controls kind="description" srclang="en">
   <source src="audesc.ogg" type="audio/ogg">
   <source src="audesc.mp3" type="audio/mp3">
 </audio>
 <!-- sign language overlay -->
 <video id="signtrack" controls kind="signing" srclang="asl">
   <source src="signing.webm" type="video/webm">
   <source src="signing.mp4" type="video/mp4">
 </video>
</par>

This synchronisation element could of course be called something else: <mastertime>, <coordinator>, <sync>, <timeline>, <container>, <timemaster> etc. The synchronisation element needs to provide the main timeline. It would make sure that the elements play and seek in parallel.


Advantages:

  1. + Audio and video elements can be styled individually as their own CSS block elements and deactivated with "display: none".

Disadvantages:

  1. - It is unclear what will happen when one element stalls. Will all stall? Will they only stall if the stalling comes from the main a/v resource? Will all other stalling be ignored in this case? Which one is the main element?
  2. - What should happen with the @controls attribute. Should there be a controls display on the first/master element if any of them has a @controls attribute? Should the slave elements not have controls displayed?
  3. - There are new attributes on the audio and video elements @srclang, @kind and @label.
  4. - Every media track has the full functionality of a media resource, but we don't want some of that functionality, such as sparate seeking, separate controls.
  5. - It's difficult to put up a user menu for these menus and generally to style them. E.g. if a sign language video is not playing, should it be displayed?
  6. - How to reconcile in-band tracks into this model?

JavaScript API:

interface video : HTMLMediaElement {
  [...]
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;

}
interface audio : HTMLMediaElement {
  [...]
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;

}
interface par {
  readonly attribute Video[] videoTracks;
  readonly attribute Audio[] audioTracks;
};

With such an interface, we can e.g. use the following to activate/deactivate the first English audio description track:

for (i in par.audioTracks) {
  if (par.audioTracks[i].kind == "audiodescription" && par.audioTracks[i].language == "en") {
    par.audioTracks[i].mode = SHOWING;
    break;
  }
}


Rendering:

There are now multiple audio and video elements on the page, so this would probably be rendered into multiple viewports. The container would, in fact, represent a common viewport. So, <par> is probably similar to <div> but with the clear notion of being a time container as well as a CSS box.

Now, there is not really a problem with layout any more, because we get the CSS box model for free in the <par>. However, since we know that it consists only of video and audio boxes, it may still make sense to introduce a CSS style on <par> that organises these, e.g. tiled, picture-in-picture, or as a list on the side of the main video, see http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_Rendering; video.style.tracks = {tiled, pip, list} with a default of "tiled".

Since the individual tracks are explicitly marked-up, there is probably no need for a pseudo-selector.

The main timeline provider is the track with kind="main". It may make sense to turn off controls for all other tracks automatically.

We can now do the following:

par {
  tracks: pip;
}
par > video.signtrack {
  width: 200px;
  opacity: 0.7;
}

(5) Nest media elements

An alternative means of re-using <audio> and <video> elements for synchronisation is to put the "slave" elements inside the "master" element like so:

<video id="v1" poster=“video.png” controls> <!-- primary content -->
 <source src=“video.webm” type=”video/webm”>
 <source src=“video.mp4” type=”video/mp4”>
 <track kind=”captions” srclang=”en” src=”captions.vtt”>
 <par>
   <audio controls> <!-- pre-recorded audio descriptions -->
     <source src="audesc.ogg" type="audio/ogg">
     <source src="audesc.mp3" type="audio/mp3">
   </audio>
   <video controls> <!-- sign language overlay -->
     <source src="signing.webm" type="video/webm">
     <source src="signing.mp4" type="video/mp4">
   </video>
 </par>
</video>

This makes clear whose timeline the element is following. But it sure looks recursive and we would have to define that elements inside a <par> cannot have another <par> inside them to stop that. So, this is actually almost identical to option (3).


(6) Synchronize separate media elements through attributes

An alternative to marked-up synchronization through an element (see option 2 to 5) is the introduction of attributes that link two or more media elements with each other determining one as the synchronization master.

This is the way in which the Timesheet implementations of http://labs.kompozer.net/timesheets/audio.html#htmlMarkup and http://labs.kompozer.net/timesheets/video.html#htmlMarkup synchronize multiple media resources (see also http://www.w3.org/TR/SMIL3/smil-timing.html#Timing-ControllingRuntimeSync):

<!-- primary content -->
<video id="v1" controls>
  <source src=“video.webm” type=”video/webm”>
  <source src=“video.mp4” type=”video/mp4”>
  <track kind=”captions” srclang=”en” src=”captions.vtt”>
</video>
<!-- pre-recorded audio descriptions -->
<audio id="a1" controls syncMaster="v1" kind="descriptions" srclang="en">
  <source src="audesc.ogg" type="audio/ogg">
  <source src="audesc.mp3" type="audio/mp3">
</audio>
<!-- sign language overlay -->
<video id="v2" controls syncMaster="v1" kind="signing" srclang="asl">
  <source src="signing.webm" type="video/webm">
  <source src="signing.mp4" type="video/mp4">
</video>

The "mediaSync/syncMaster" attribute would modify the playback and seeking behavior of the media elements to which it is applied. Consequently, the "controls" attribute would probably be overridden (to false) and the JavaScript control API would be disabled so that the playback of the supplemental audio and video would depend entirely on the "master" media object.

Advantages:

  1. + Audio and video elements can be styled individually as their own CSS block elements and deactivated with "display: none".
  2. + Audio and video elements retain their full functionality, but we may possibly block the use of @controls and seeking or make sure they have individual effect on other related resources.
  3. + Doesn't require any new elements, just attributes.

Disadvantages:

  1. - How do you activate the subordinate tracks? Would the element have a disabled state?
  2. - It is unclear what will happen when one element stalls. Will all stall? Will they only stall if the stalling comes from the main a/v resource? Will all other stalling be ignored in this case? Which one is the main element?
  3. - What should happen with the @controls attribute. Should there be a controls display on the first/master element if any of them has a @controls attribute? Should the slave elements not have controls displayed?
  4. - There are new attributes on the audio and video elements @srclang, @kind and possibly @label.
  5. - Every media track has the full functionality of a media resource, but we don't want some of that functionality, such as sparate seeking, separate controls.

JavaScript API:

interface video : HTMLMediaElement {
  [...]
           attribute DOMString syncMaster;
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;
}
interface audio : HTMLMediaElement {
  [...]
           attribute DOMString syncMaster;
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;
}

With such an interface, we can e.g. use the following to activate/deactivate the first English audio description track of video v1:

// get all video elements that depend on v1
audioTracks = new Array[];
index = 0;
for (i in document.getElementsByTagname("audio")) {
  if (i.syncMaster == "v1") {
    audioTracks[index] = i;
    index++;
  }
}
for (i in audioTracks) {
  if (audioTracks[i].kind == "audiodescription" && audioTracks[i].language == "en") {
    audioTracks[i].mode = SHOWING;
    break;
  }
}

Rendering:

There are now multiple video elements on the page, probably each with its own controls. This makes them very loosely coupled. There is probably no opportunity to decide on tiling, pip or list display of the synchronised elements, but they have the full power of CSS available to themselves.


We can do the following:

video {
  position: relative;
}
video.v1 {
  width: 600px;
  height: 400px;
}
video.v2 {
  width: 200px;
  opacity: 0.7;
  left: -240px;
  top: 200px;
}

This will place v2 on top of v1.

(7) Overload <track> to link a/v element from other place on page

Instead of using attributes on the audio or video elements to link in a external resource, we can also use <track> to do that linking.

For example synchronizing external audio description and sign language video with main video:

<video id="v1" poster=“video.png” controls>
 <!-- primary content -->
 <source src=“video.webm” type=”video/webm”>
 <source src=“video.mp4” type=”video/mp4”>
 <track kind=”captions” srclang=”en” src=”captions.vtt”>
 <!-- pre-recorded audio descriptions -->
 <track kind="descriptions" srclang="en" label="English Audio Description" idref="ad">
 <!-- sign language overlay -->
 <track kind="signings" srclang="asl" label="American Sign Language" idref="sign">
</video>
<!-- pre-recorded audio descriptions -->
<audio id="ad">
  <source src="audesc.ogg" type="audio/ogg">
  <source src="audesc.mp3" type="audio/mp3">
</audio>
<!-- sign language overlay -->
<video id="sign>
  <source src="signing.webm" type="video/webm">
  <source src="signing.mp4" type="video/mp4">
</video>

In this approach we add a @idref attribute to the <track> element, allowing it to also link to other audio and video elements.

Advantages:

  1. + The HTML markup clearly exposes what tracks are available (the "discovery" issue).
  2. + All types of external tracks are perceived to be handled in the same way, no matter if text, video or audio.
  3. + Existing audio/video and source elements are used to provide a solution for alternative codec files (the "codec" issue).

Disadvantages:

  1. - It is not immediate obvious that the separate audio and video elements are dependents of the first video element. Are they allowed @controls? What about separate seeking etc?
  2. - How do we deactivate/activatae the external audio/video elements?

JavaScript API: We would reuse the TextTrack API for these types of tracks, too, and just introduce some further @kind such as signlanguage or audiodescription. A part of the TextTrack API - the elements and attributes dealing with cues - will be irrelevant:

interface TextTrack {
  [..]
  readonly attribute DOMString idref;
}
interface HTMLMediaElement {
  [..]
  const unsigned short OFF = 0;
  const unsigned short HIDDEN = 1;
  const unsigned short SHOWING = 2;
           attribute unsigned short mode;
}

With such an interface, we can e.g. use the following to activate/deactivate the first English audio description track:

for (i in video.track) {
  if (video.track[i].kind == "audiodescription" && video.track[i].language == "en") {
    getElementById(video.track[i].idref).mode = SHOWING;
    break;
  }
}

Rendering:

Again, there are now multiple video elements on the page, probably each with its own controls. This makes them very loosely coupled. There is probably no opportunity to decide on tiling, pip or list display of the synchronised elements, but they have the full power of CSS available to themselves.


We can do the following:

video {
  position: relative;
}
video.v1 {
  width: 600px;
  height: 400px;
}
video.sign {
  width: 200px;
  opacity: 0.7;
  left: -240px;
  top: 200px;
}

This will place v2 on top of v1.

(8) Overload <track>, with <source>, inside <video>

Given the following observations:

  • there are already two formats for external caption files (TTML and WebVTT) -- three if you count SRT
  • that sign-language is naturally coded as an optional video-formatted track

We suggest that we: a) generalize the <track> element to allow any media type (text, video, audio) b) generalize the <track> element to allow either a src attribute, or child <source> elements, as today.

The current <track> API allows this for in-band data that "the user agent recognises and supports as being equivalent to a text track", so we think we should extend <track> to support other media types instead of creating a new mechanism or element type. This can be done with a combination of options 2 and 3 - generalizing <track> to allow the inclusion of external audio, video, and accomodating multiple media formats and configurations with <source> elements as we do for <audio> and <video>.

Here is the example from the multi-track wiki page with multiple formats for the audio description and sign language tracks;   allowing <source> inside of <track> also makes it possible to include alternate caption formats:

  <video id="v1" poster=“video.png” controls>
      <!-- primary content -->
      <source src=“video.webm” type=”video/webm”>
      <source src=“video.mp4” type=”video/mp4”>

      <!-- pre-recorded audio descriptions -->
      <track id="a1" kind="descriptions" srclang="en" label="English Audio Description">
          <source src="audesc.ogg" type="audio/ogg">
          <source src="audesc.mp3" type="audio/mpeg">
      </track>

      <!-- sign language overlay -->
      <track id="v2" kind="signings" srclang="asl" label="American Sign Language">
          <source src="signlang.webm" type="video/webm">
          <source src="signlang.mp4" type="video/mp4">
      </track>

      <!-- captions -->
      <track id="c1" kind="captions"  srclang="en" label="Captions">
          <source src="captions.vtt" type="text/vtt">
          <source src="captions.xml" type="application/ttml+xml">
      </track>
  </video>

Unlike option 3 this does not require new interfaces, but it will require a new IDL attribute on <track> so it is possible to determine the media type from JavaScript (e.g. a top-level MIME type such as video, audio, text, application).

Advantages:

We can permit experimentation in accessibility 'kind's (actually, what accessibility need is satisfied by an option). We can permit development in the coding formats supported for acessibility media types and 'kind's.


Disadvantages:

The question of how several visual pieces are layed out remains.  Given that CSS is used to lay out the rest of the page, it seems natural to use it, but that means CSS needs to be able to enable/disable tracks, position/size them, and be aware of the user needs.


Javascript API:

 Change TextTrack to MediaTrack and add an attribute for the type of media:

interface MediaTrack {
 readonly attribute DOMString mediaType;
 readonly attribute DOMString kind;
 readonly attribute DOMString label;
 readonly attribute DOMString language;

 const unsigned short OFF = 0;
 const unsigned short HIDDEN = 1;
 const unsigned short SHOWING = 2;
          attribute unsigned short mode;

 const unsigned short NONE = 0;
 const unsigned short LOADING = 1;
 const unsigned short LOADED = 2;
 const unsigned short ERROR = 3;
 readonly attribute unsigned short readyState;
 readonly attribute Function onload;
 readonly attribute Function onerror;
}

With this interface we can use the following to activate/deactivate the first English audio description track:

for (var track in video.tracks) {
   if (track.mediaType == "audio" && track.kind == "descriptions" && track.language == "en") {
       track.mode = SHOWING;
       break;
   }
}

Rendering:

There is only one video element on the page, so we probably need to render all of the tracks into the same video viewport with only one control.

Layout can again be done as, e.g. tiled, picture-in-picture, or as a list on the side of the main video, see http://www.w3.org/WAI/PF/HTML/wiki/Media_Multitrack_Media_Rendering; this could be specified as a CSS style on the video element, video.style.tracks = {tiled, pip, list} with a default of "tiled".

Since the individual tracks are explicitly marked-up, there is probably no need for a pseudo-selector (although... what about in-band tracks...).

We can do the following:

video {
  tracks: pip;
}
track.v2 {
  width: 200px;
  opacity: 0.7;
}

(9) Audio Track Selection for Media Element / In-band only

For completeness, here is also the essence of the change proposal by Microsoft as per http://lists.w3.org/Archives/Public/public-html/2011Feb/0363.html .

It is a minimal extension to the existing HTMLMediaElement API in that it does not provide detailed access to the media tracks themselves, but merely provides a means of indicating their presence and a means of selecting between the presentation modes.

There is no new HTML markup.

A concern is voiced that extending the text track mechanism is problematic for external media tracks, since the required level of synchronisation between the external track and the internal track would require a sophisticated media engine.

This proposal therefore proposes to only support tracks internal to the media, where synchronisation is readily achievable using existing media frameworks. It does so by introducing an extension to the existing HTMLMediaElement API for determining whether there are alternate audio tracks contained inside a media resource based on the intended mode of presentation those tracks (e.g. audio descriptions, alternative language tracks); and to allow page author to publish the corresponding media content and provide the means to select between the audio tracks. It also encourages browser developers to expose UI in the default players to select alternate tracks in multitrack media resources.


Advantages:

  1. + There is no need to define any new markup (i.e. elements and attributes) for it, just a JavaScript API.
  2. + Existing media frameworks already deal with such multitrack resources, so implementation is simpler.
  3. + The actual media data can be stored in segments (depending on the media format on the server end). It is up to the UA to stitch them together under the hood, so no additional bandwidth would be required.

Disadvantages:

  1. - No means to provide accessibility data as additional data out-of-band - thus the main media becomes heavier and substantially more bandwidth is used per video.
  2. - This proposal is only focused on additional audio tracks - what about video?

JavaScript API: We require a means to expose the list of available tracks for a media resource in JavaScript and a means to activate/deactivate the tracks:

interface HTMLMediaElement : HTMLElement {
  [...]

  // audio tracks
  readonly attribute unsigned long audioTrackCount;
  readonly attribute DOMString audioTrackLanguage[];
           attribute unsigned long currentAudioTrack;
};

The numberAudioTracks attribute represent number of audio tracks embedded in the the media resource set to the media element. The audioTrackLang attribute represent the language [BCP47] of each of the audio tracks based on a zero-based index. The selectedAudioTrack attribute represent the index of the audio track curently being selected for the media element.

When the currentAudioTrack selection is changed, the user agent must queue a task to fire a simple event named "audiotrackchange".

If the author selects an index out of the range of what is allowed by the numberAudioTracks attribute, the UA throws an INDEX_SIZE_ERR exception.


With such an interface, we can e.g. use the following to activate/deactivate the first English audio description track:

  // select audio track with English language as in United Kingdom
  for(var i = 0; i < videoElement.audioTrackCount; i++){
      if(videoElement.audioTrackLanguage[i] == "en-UK"){
          videoElement.currentAudioTrack = i;
          break;
      }
  }

The following issue with this concrete proposal have been voiced:

  • Only allows one audio track to be enabled at a time, making it unsuitable for audio descriptions voice-over which should be played in sync with the original audio track. I'm not sure how common this is in practice, but the alternative is to make a complete new audio mix with both the original audio and the voice-over.
  • audioTrackCount/audioTrackLanguage is inconsistent with TextTrack[] tracks where language information is in TextTrack.language.
  • audioTrackCount is redundant with audioTrackLanguage.length
  • Enables/disabled tracks using currentAudioTrack rather than TextTrack.mode
  • If audio tracks are added or removed during playback, will currentAudioTrack unsigned long implicitly change with it or will the current track actually change?

Note that audio tracks don't need extra rendering. A more generic approach is detailed in option 1.

(10) HTML Accessibility Task Force proposal - "The San Diego Thought Experiment"

Evolving working proposal from the HTML accessibility task force's San Diego face-to-face.

Summary of the use case discussion:

  1. we need to support in-band multitrack media, e.g. sign language, audio description, language dubs provided in the same resource in-band
  2. we need to support out-of-band mutltirack media which are tightly coupled, in that they require a single control, where the main resource is a master and the markup needs to be simple without requiring extra CSS to display them as though they were in-band
  3. we may want to support out-of-band multitrack media which are loosely coupled, in that they are separate elements on the page each with their own controls, but interacting with one means the other(s) follow

Guidance for developing the synchronization solution:

  • video should be synchronized for sign language to 0.5sec accuracy
  • audio should be synchronized for audio descriptions to 100ms resolution
  • we want to achieve a consistent API between in-band and external audio/video tracks
  • we want to be able to control the relative volume of a additional audio track by user and author
  • we want to be able to control positioning of video tracks as picture-in-picture or side-by-side viewports in script by author
  • we don't want to inflict more complexity on the <source> selection, which is based on choice of encoding formats, but we can replicate that somewhere else
  • the same source selection algorithm needs to be applied to all choices of media encoding formats, so it is predictable to the author
  • we want to satisfy the 80% case - e.g. synchronizing an extended audio description with an un-extended original video is not what we want to do here
  • The main resource defines the timeline of all the tracks. They are always in sync.
  • External resources that contain multiple tracks would be parsed into their individual tracks and added as TextTrack, VideoTrack, or Audio Track.

JavaScript API:

interface HTMLMediaTrack {
  readonly attribute DOMString kind;
  readonly attribute DOMString label;
  readonly attribute DOMString language;

  const unsigned short HAVE_NOTHING = 0;
  const unsigned short HAVE_METADATA = 1;
  const unsigned short HAVE_CURRENT_DATA = 2;
  const unsigned short HAVE_FUTURE_DATA = 3;
  const unsigned short HAVE_ENOUGH_DATA = 4;
  readonly attribute unsigned short readyState;

  readonly attribute MediaError error;
          attribute Function onerror;

  const unsigned short OFF = 0;
  const unsigned short INACTIVE = 1;
  const unsigned short ACTIVE = 2;
          attribute unsigned short mode;
          attribute Function onmodechange;

          attribute DOMString src;
  readonly attribute DOMString currentSrc;
};

interface TextTrack : HTMLMediaTrack {
  readonly attribute TextTrackCueList cues;
  readonly attribute TextTrackCueList activeCues;
           // event raised if a cue becomes active/inactive
           // with target being the activated/deactivated TextTrackCue
           attribute Function oncueenter;
           attribute Function oncueexit;
};

interface VideoTrack : HTMLMediaTrack {
           attribute unsigned long width;
           attribute unsigned long height;
  readonly attribute unsigned long videoWidth;
  readonly attribute unsigned long videoHeight;
};

interface AudioTrack : HTMLMediaTrack {
           attribute bolean muted;
           attribute double volume;
};

interface MediaTracksCollection {
  readonly attribute unsigned long length;
  getter HTMLMediaTrack (in unsigned long index);
  HTMLMediaTrack getTrackById(in DOMString id);
};

interface HTMLMediaElement {
  [...]
  // use typeof(track[i]) to identify the media type
  // use tack.length to identify the number of tracks
  readonly attribute HTMLMediaTracksCollection track;
  // returns for a mime type (incl text type) if it can be used in track: ''/maybe/probably
  DOMString canPlayTrackType(in DOMString type);
 };


Markup:

We discussed a few examples of markup for external resources. The most consistent means of marking up for the JS API above is to extend the functionality of the <track> element.

To make <track> work for audio and video, we introduced the <source> tag also for <track>. The source selection algorithm will continue to work as for the media elements. This can now also be applied to text tracks, e.g. to select between alternative text encodings of the same content:

<video id="v1" poster="video.png" controls>
    <!-- primary content -->
    <source src="video.webm" type="video/webm">
    <source src="video.mp4" type="video/mp4">

    <!-- captions -->
    <track id="c1" kind="captions"  srclang="en" label="Captions">
        <source src="captions.vtt" type="text/vtt">
        <source src="captions.xml" type="application/ttml+xml">
    </track>
</video>

The @name attribute is used to define mutually exclusive tracks, in the same way it is with <input> for radio buttons. @mode='enabled' is used to mark a track in a group that is to be enabled in the absense of a user preference:

<video id="v1" poster="lecture.png" controls>
    <!-- audio -->
    <source src="lecture_audio.mp4" type="video/mp4">

    <!-- video, with and without burned-in sign language, only one will be used -->
    <track name="Lecture video" src="lecture_video.mp4"  mode="enabled"></track>
    <track name="Lecture video" src="lecture_video_with_signing.mp4" ></track>
</video>


<video id="v1" poster="lecture.png" controls>
    <!-- "timeline" track, does not provide audio or video, used to define the timeline -->
    <source src="timeline.mp4" type="video/mp4">

    <!-- video, with and without burned-in sign language, only one will be used -->
    <track name="Lecture video" src="lecture_video.mp4" mode="enabled" label="Lecture video"></track>
    <track name="Lecture video" kind="signing" src="lecture_video_with_signing.mp4" label="Lecture video"></track>

    <!-- alternate original audio and clear-audio, only one will be used -->
    <track name="Lecture audio" src="lecture_audio.mp4" mode="enabled" label="Lecture audio"></track>
    <track name="Lecture audio" kind="clearaudio" src="lecture_clear_audio.mp4" label="Lecture clear-audio"></track>
</video>