This Wiki page is edited by participants of the HTML Accessibility Task Force. It does not necessarily represent consensus and it may have incorrect information or information that is not supported by other Task Force participants, WAI, or W3C. It may also have some very useful information.

Media Navigation

From HTML accessibility task force Wiki
Jump to: navigation, search

Navigation of audio and video content by semantic structure

Related Bugs and Issues

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12662

http://www.w3.org/Bugs/Public/show_bug.cgi?id=10693

http://www.w3.org/Bugs/Public/show_bug.cgi?id=13184

http://www.w3.org/html/wg/tracker/issues/163


Proposals due: 1st July

http://lists.w3.org/Archives/Public/public-html/2011May/0428.html


Use Case

See also: http://www.w3.org/WAI/PF/HTML/wiki/Media_Accessibility_Requirements#Content_navigation_by_content_structure

Media resources are typically large, time-based objects which do not easily expose direct access to their semantic content. Seeking on the transport bar often provides the only means to navigate to specific locations in the timeline that a user believes to contain the information they are looking for. This, however, is a very in-exact means of navigation, based on guessing and not on semantic knowledge. It is particularly useless for blind users that cannot even gain the small insights that the timeline exposes.

We know from DVDs that direct access to chapters and subchapters in videos are a more successful and accurate means of navigation.

The DAISY standard is providing a similarly accurate and useful means of navigation to blind users. It allows them to gain an overview of the media resource's content through section markers that allow a type of "speed reading".

Just as the structures introduced particularly by nonfiction titles make books more usable, media is more usable when its inherent semantic structure is exposed. Direct access to semantic structure is critical for persons with disabilities who cannot infer structure from purely presentational queues.


Requirements

HTML5 has introduced the notion of "chapter tracks" to satisfy the navigational needs of users on media resources. As the specification stands right now, chapter tracks satisfy the DVD use case: a timeline is broken into a linearly successive sections without any further subdivision.

However, to replicate the flexibility of the DAISY standard, we need to introduce several levels of hierarchical navigation. While DAISY supports an unlimited number of hierarchically organised navigation levels, a maximum of 6 levels has been seen in the wild and 3-4 levels are typical.

An example with multiple levels may be a reading of the bible with the following levels:

  • h1: testaments (old/new)
  • h2: books inside the testaments
  • h3: chapters inside the books
  • h4: verses inside chapters
  • h5: phrases inside verses
  • h6: words inside phrases

DAISY devices provide the following keyboard controls to support the navigation:

  • up arrow: move up a hierarchical level
  • down arrow: drill down into a hierarchical level
  • left/right arrow: move between entities of a single hierarchical level
  • enter: select to execute the navigation

Examples of hierarchical chapters from DAISY can be found at http://www.daisy.org/sample-content - take in particular those that say "demonstrating DAISY navigation".


Related Markup examples as background information

Related navigation constructs in HTML

1. Headers (h1, h2, ..., h6)

Screen readers navigate HTML headers in a similar manner to how DAISY navigates: it is possible to jump between headers of the same level and drill down into lower levels.

2. Lists (ol, ul)

Screen readers also navigate ordered or unordered HTML lists in a similar manner.

3. Navigation (nav)

Typically ol/ul is used inside a nav to provide for navigation structure. nav provides additional semantics for screen readers.

4. Section / Article (section, article)

These are new HTML5 elements and not supported in screen readers yet, so tend to just be mapped to div.


Note that all screen readers that navigate through hierarchical constructs do so in a depth-first manner. Some also allow to ignore the depth and continue on the same level. This seems best implemented using headers. (is this true?)


The DAISY navigation approach

DAISY/DTB/EPUB use a special XML file format called NCX (navigation control file for XML applications) to create a navigation structure over the HTML files provided as part of a document package. Its development was motivated by the need to provide quick access to the main structural elements of the DAISY document without the need to parse the entire marked-up text files. It introduces "navMap", "navPoint" and "navList" elements. Here is an example:

  <navMap>
    <navPoint id="ncx1" class="h1" playOrder="1">
      <navLabel><text>Valentin Haüy The father of the education for the blind</text></navLabel>
      <content src="valentinhauy11.html#ops1" />
      <navPoint id="ncx2" class="h2" playOrder="2">
        <navLabel><text>List of contents</text></navLabel>
        <content src="valentinhauy11.html#rgn_cnt_0026" />
      </navPoint>
      <navPoint id="ncx3" class="h2" playOrder="4">
        <navLabel><text>Preface</text></navLabel>
        <content src="valentinhauy11.html#rgn_cnt_0095" />
      </navPoint>
      <navPoint id="ncx4" class="h2" playOrder="6">
        <navLabel><text>1. Research questions</text></navLabel>
        <content src="valentinhauy11.html#rgn_cnt_0103" />
      </navPoint>
      [...]
    </navPoint>
    [...]
  </navMap>

This example shows two navigation levels.


For multimedia files, xhtml files with links to smil files are used by DAISY, for example:

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1-transitional.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<title>ごんぎつね</title>
		<meta name="dc:format" content="Daisy 2.02"/>
	</head>
	<body>
		<h1 class="title" id="hotl_0001"><a href="hotl0001.smil#hotl_0001">ごん狐</a></h1>
		<h1 id="hotl_0003"><a href="hotl0002.smil#dol_1_2_hotl_0003">一</a></h1>
		<span class="page-normal" id="xhot_0005"><a href="hotl0002.smil#dol_1_2_hotl_0004">1</a></span>
		<h2 id="hotl_0005"><a href="hotl0003.smil#dol_1_3_hotl_0005">1-2</a></h2>
		<h2 id="hotl_0006"><a href="hotl0004.smil#dol_1_4_hotl_0006">1-3</a></h2>
		<h2 id="hotl_0007"><a href="hotl0005.smil#dol_1_5_hotl_0007">1-4</a></h2>
		<h2 id="hotl_0008"><a href="hotl0006.smil#dol_1_6_hotl_0008">1-5</a></h2>
		<h2 id="hotl_0009"><a href="hotl0007.smil#dol_1_7_hotl_0009">1-6</a></h2>
		<h2 id="hotl_000a"><a href="hotl0008.smil#dol_1_8_hotl_000a">1-7</a></h2>
		<h2 id="hotl_000b"><a href="hotl0009.smil#dol_1_9_hotl_000b">1-8</a></h2>
		<h2 id="hotl_000c"><a href="hotl000a.smil#dol_1_a_hotl_000c">1-9</a></h2>
		<h1 id="hotl_000e"><a href="hotl000c.smil#dol_1_c_hotl_000e">二</a></h1>
		<span class="page-normal" id="xhot_003b"><a href="hotl000c.smil#dol_1_12_hotl_000f">2</a></span>
		<h2 id="hotl_0010"><a href="hotl000d.smil#dol_1_13_hotl_0010">2-1</a></h2>
		<h2 id="hotl_0011"><a href="hotl000e.smil#dol_1_14_hotl_0011">2-2</a></h2>
		<h2 id="hotl_0012"><a href="hotl000f.smil#dol_1_15_hotl_0012">2-3</a></h2>
		<h2 id="hotl_0013"><a href="hotl0010.smil#dol_1_16_hotl_0013">2-4</a></h2>
	</body>
  </html>

This example shows two navigation levels plus a content level.


Possible Markup for TTML

See: http://www.w3.org/WAI/PF/HTML/wiki/TextFormat_Mapping_to_Requirements#cn1

<?xml version="1.0" encoding="utf-8"?>
<tt xml:lang="en" ttp:timebase="clock"
xmlns="http://www.w3.org/ns/ttml"
xmlns:ttp="http://www.w3.org/ns/ttml#parameter">
  <body  role="x-nav-work" timeContainer='seq'>
    <div role="x-nav-section" timeContainer='seq'>
      <p role="x-nav-section" timeContainer='seq'>
        <span role="x-nav-section" dur="11.300s">Index point 1.1.1 </span>
        <span role="x-nav-section" dur="20.100s">Index point 1.1.2 </span>
        <span role="x-nav-section" dur="12.900s">Index point 1.1.3 </span>
        <span role="x-nav-section" dur="13.700s">Index point 1.1.4 </span>
      </p>
      <p role="x-nav-section" timeContainer='seq'>
        <span role="x-nav-section" dur="7.200s">Index point 1.2.1 </span>
        <span role="x-nav-section" dur="28.500s">Index point 1.2.2 </span>
        <span role="x-nav-section" dur="31.090s">Index point 1.2.3 </span>
        <span role="x-nav-section" dur="41.000s">Index point 1.2.4 </span>
      </p>
     </div>
    <div role="x-nav-section" timeContainer='seq'>
      <p role="x-nav-section" timeContainer='seq'>
        <span role="x-nav-section" dur="11.300s">Index point 2.1.1 </span>
        <span role="x-nav-section" dur="20.100s">Index point 2.1.2 </span>
        <span role="x-nav-section" dur="12.900s">Index point 2.1.3 </span>
        <span role="x-nav-section" dur="13.700s">Index point 2.1.4 </span>
      </p>
      <p role="x-nav-section" timeContainer='seq'>
        <span role="x-nav-section" dur="1.300s">Index point 2.2.1 </span>
        <span role="x-nav-section" dur="2.100s">Index point 2.2.2 </span>
        <span role="x-nav-section" dur="2.900s">Index point 2.2.3 </span>
        <span role="x-nav-section" dur="3.700s">Index point 2.2.4 </span>
      </p>
    </div>
  </body>
</tt>

This example shows two different navigation levels.


Possible Markup for WebVTT

In analogy to the other examples, WebVTT can also provide nested navigation within cues. This is not currently specified, but a possible extension for chapter tracks. Maybe something along the following lines:

WEBVTT

00:00:00.000 --> 00:00:10.700
Title Slide

00:00:10.700 --> 00:00:47.600
Introduction by Naomi Black

00:00:47.600 --> 00:07:37.900
Talk on WebVTT
<nav>
  <00:00:47.600>Impact of Captions on the Web
  <00:01:50.100>Requirements of a Video text format
  <00:03:33.000>Simple WebVTT file
  <00:04:57.766>Styled WebVTT file
  <00:06:16.666>Internationalized WebVTT file
</nav>

This example shows two navigation levels.


Technical solutions for HTML5

The solutions to this problem need to provide a markup means for hierarchical navigation and a JavaScript API to expose it, while at the same time making sure to maintain the relationship between the hierarchical levels such that keyboard controls as described in the requirements can be implemented.

The specification for navigation of media relies on chapter tracks: http://www.whatwg.org/specs/web-apps/current-work/multipage/the-iframe-element.html#attr-track-kind, and http://dev.w3.org/html5/spec/the-iframe-element.html#attr-track-kind.

Tracks of kind "chapter" provide: "Chapter titles, intended to be used for navigating the media resource. Displayed as an interactive list in the user agent's interface."

So, the key problem to solve is how to expose hierarchical structures to the browser/script.


Possible solutions

1. Within a (chapter) text track

The examples in the previous section show different means of marking up a chapter track in external text track resources such that cues can contain a hierarchical structure.

The thus created nested time ranges can be used for hierarchical navigation.

Part of their mapping rules into HTML5 would contain a mapping to either header tags (h1, h2, h3...) or to nested lists (ul or ol). Probably a mapping to nested ul lists would make the most sense. Then the navigation structure is available to the browser and can be handed on to AT.

Discussion:
- chapter cues can now also include lists
+ hierarchical relationship is clear
+ easy to communicate hierarchical notion to AT, since it builds on existing structures
+ no change to the existing API is required


The hierarchical markup within a cue would end up in TextTrackCues that will have some extra HTML markup in their cue text, which can be accessed through getCueAsHTML() and thus exposed to AT.

Here is an example in WebVTT to demonstrate how it works:

webvtt.vtt:

WEBVTT

00:00:00.000 --> 00:00:10.700
Title Slide

00:00:10.700 --> 00:00:47.600
Introduction by Naomi Black

00:00:47.600 --> 00:07:37.900
Talk on WebVTT
<nav>
  <00:00:47.600>Impact of Captions on the Web
  <00:01:50.100>Requirements of a Video text format
  <00:03:33.000>Simple WebVTT file
  <00:04:57.766>Styled WebVTT file
  <00:06:16.666>Internationalized WebVTT file
</nav>


track markup:

  <video src="video.ogv" controls>
    <track src="webvtt.vtt" kind="chapter" label="chapter and subchapter level navigating">

  </video>


And here is roughly how it is represented in the JS API:

TextTrackCueList {
  length : 3,
  TextTrackCue(0) {
    track: <TextTrack>,
    id : '',
    startTime: '00:00:00.000',
    endTime: '00:00:10.700',
    pauseOnExit: false,
    direction: horizontal,
    snapToLines: false,
    linePosition: 100,
    textPosition: 50,
    size: 3,
    alignment: center,
    getCueAsSource(): "Title Slide",
    getCueAsHTML(): "Title Slide"
  },
  TextTrackCue(1) {
    [..]
    startTime: '00:00:00.000',
    endTime: '00:00:10.700',
    getCueAsSource(): "Introduction by Naomi Black",
    getCueAsHTML(): "Introduction by Naomi Black"
  },
  TextTrackCue(2) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:07:37.900',
    getCueAsSource():
    "Talk on WebVTT
     <nav>
       <00:00:47.600>Impact of Captions on the Web
       <00:01:50.100>Requirements of a Video text format
       <00:03:33.000>Simple WebVTT file
       <00:04:57.766>Styled WebVTT file
       <00:06:16.666>Internationalized WebVTT file
     </nav>",
    getCueAsHTML(): "Talk on WebVTT
      <ul>
        <li><? target='timestamp' data='00:00:47.600'>Impact of Captions on the Web</li>
        <li><? target='timestamp' data='00:01:50.100'>Requirements of a Video text format</li>
        <li><? target='timestamp' data='00:03:33.000'>Simple WebVTT file</li>
        <li><? target='timestamp' data='00:04:57.766'>Styled WebVTT file</li>
        <li><? target='timestamp' data='00:06:16.666'>Internationalized WebVTT file</li>,
      </ul>"
  }
}

The getCueAsHTML() accessor will return a structured DocumentFragment, that AT can use to provide the navigation.


2. With multiple tracks

It is possible to provide the different navigation approaches through multiple tracks that each contain a flat navigation structure.

Parallel tracks of type "chapter" could be used for hierarchical navigation. The user then has the chance to switch between chapter tracks to get to a finer / rougher navigation resolution.

Discussion:
- hierarchical relationship between tracks and their cues is unclear
- hard to communicate the hierarchical notion to AT
+ chapter tracks continue to work as right now
+ the different tracks don't have to be strictly hierarchically dependent - they can just provide alternative chapter segmentations


To show an example, we need to have two input tracks and combine them together through the <track> markup.

Here are two WebVTT files that replicate what is shown in 1. above.

webvtt1.vtt:

WEBVTT

00:00:00.000 --> 00:00:10.700
Title Slide

00:00:10.700 --> 00:00:47.600
Introduction by Naomi Black

00:00:47.600 --> 00:07:37.900
Talk on WebVTT


webvtt2.vtt:

WEBVTT

00:00:47.600 --> 00:01:50.100
Impact of Captions on the Web

00:01:50.100 --> 00:03:33.000
Requirements of a Video text format

00:03:33.000 --> 00:04:57.766
Simple WebVTT file

00:04:57.766 --> 00:06:16.666
Styled WebVTT file

00:06:16.666 --> 00:07:37.900
Internationalized WebVTT file


track markup:

  <video src="video.ogv" controls>
    <track src="webvtt1.vtt" kind="chapter" label="level 1 navigation">
    <track src="webvtt2.vtt" kind="chapter" label="level 2 navigation">
  </video>


And here is how they would be represented in the JS API:

TextTrackCueList[0] {
  length : 3,
  TextTrackCue(0) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:07:37.900',
    getCueAsSource(): "Title Slide",
    getCueAsHTML(): "Title Slide"
  },
  TextTrackCue(1) {
    [..]
    startTime: '00:00:10.700',
    endTime: '00:00:47.600',
    getCueAsSource(): "Introduction by Naomi Black",
    getCueAsHTML(): "Introduction by Naomi Black"
  },
  TextTrackCue(2) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:07:37.900',
    getCueAsSource(): "Talk on WebVTT",
    getCueAsHTML(): "Talk on WebVTT"
  }
}
TextTrackCueList[1] {
  length: 5,
  TextTrackCue(0) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:01:50.100',
    getCueAsSource(): "Impact of Captions on the Web",
    getCueAsHTML(): "Impact of Captions on the Web"
  },
  TextTrackCue(1) {
    [..]
    startTime: '00:01:50.100',
    endTime: '00:03:33.000',
    getCueAsSource(): "Requirements of a Video text format",
    getCueAsHTML(): "Requirements of a Video text format"
  },
  TextTrackCue(2) {
    [..]
    startTime: '00:03:33.000',
    endTime: '00:04:57.766',
    getCueAsSource(): "Simple WebVTT file",
    getCueAsHTML(): "Simple WebVTT file"
  },
  etc.
}


3. Single-track, multiple cues

A third means would be to use the second example, but put all the cues that can be found at different navigation levels into a single track.

As a consequence, it is possible that multiple tracks would be active at the same time.

Discussion:
- hierarchical relationship between the cues is unclear
- hierarchical relationship has to deducted from the timing overlaps, which is very unreliable
- unclear which currently active cue will be chosen for navigation when a certain time is reached
- hierarchical character is lost from the authored file
- hard to communicate the hierarchical notion to AT
- may need to introduce a new attribute to indicate the hierarchical notion
+ easier to deal with hierarchical character in JavaScript

Here is an example WebVTT file for this situation:

WEBVTT

00:00:00.000 --> 00:00:10.700
<h1>Title Slide

00:00:10.700 --> 00:00:47.600
<h1>Introduction by Naomi Black

00:00:47.600 --> 00:07:37.900
<h1>Talk on WebVTT

00:00:47.600 --> 00:01:50.100
<h2>Impact of Captions on the Web

00:01:50.100 --> 00:03:33.000
<h2>Requirements of a Video text format

00:03:33.000 --> 00:04:57.766
<h2>Simple WebVTT file

00:04:57.766 --> 00:06:16.666
<h2>Styled WebVTT file

00:06:16.666 --> 00:07:37.900
<h2>Internationalized WebVTT file


track markup:

  <video src="video.ogv" controls>
    <track src="webvtt.vtt" kind="chapter" label="chapter and subchapter level navigating">
  </video>


And here is how it would be represented in the JS API: [added a possible level attribute]

TextTrackCueList {
  length : 8,
  TextTrackCue(0) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:07:37.900',
    getCueAsSource(): "Title Slide",
    getCueAsHTML(): "Title Slide",
    level: 1
  },
  TextTrackCue(1) {
    [..]
    startTime: '00:00:10.700',
    endTime: '00:00:47.600',
    getCueAsSource(): "Introduction by Naomi Black",
    getCueAsHTML(): "Introduction by Naomi Black",
    level: 1
  },
  TextTrackCue(2) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:07:37.900',
    getCueAsSource(): "Talk on WebVTT",
    getCueAsHTML(): "Talk on WebVTT",
    level: 1
  },
  TextTrackCue(3) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:01:50.100',
    getCueAsSource(): "Impact of Captions on the Web",
    getCueAsHTML(): "Impact of Captions on the Web",
    level: 2
  },
  etc.
}

In this example, we added a level attribute to indicate the hierarchical position.

Example markup with 3 hierarchical levels

Here we look at an actual example that would be coming from either a TTML or WebVTT document and end up as the same structure in a Web page. The example has 3 hierarchical levels, just to show the principle.

TTML example

TTML markup:

  <?xml version="1.0" encoding="utf-8"?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1-transitional.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
      <title>Test File</title>
    </head>
    <body>
      <div>
        <p begin="0.76s" end="3.45s">
          Chapter 1
        </p>
        <p begin="3.45s" end="8.0s">
          Chapter 2
          <div>
            <p begin="3.45s" end="4.0s">
              Subchapter 1
            </p>
            <p begin="4.0s" end="6.0s">
              Subchapter 2
              <div>
                <p begin="4.0s" end="4.5s">
                  Paragraph 1
                </p>
                <p begin="4.5s" end="5.0s">
                  Paragraph 2
                </p>
                <p begin="5.0s" end="6.0s">
                  Paragraph 3
                </p>
              </div>
            </p>
            <p begin="6.0s" end="8.0s">
              Subchapter 3
            </p>
          </div>
        </p>
        <p begin="8.0s" end="16.0s">
          Chapter 3
        </p>
      </div>
    </body>
  </html>

WebVTT equivalent example

WebVTT markup:

WEBVTT

00:00:00.760 --> 00:0:03.450
Chapter 1

00:00:03.450 --> 00:0:08.000
Chapter 2
<nav>
  <00:00:03.450>Subchapter 1
  <00:00:04.000>Subchapter 2
  <nav>
    <00:00:04.000>Paragraph 1
    <00:00:04.500>Paragraph 2
    <00:00:05.000>Paragraph 3
  </nav>
  <00:00:06.000>Subchapter 3
</nav>

00:00:08.000 --> 00:0:16.000
Chapter 3

Track markup with both

  <video src="video.ogv" controls>
    <track src="chapters.vtt"  kind="chapter" label="chapter, subchapter and paragraph level navigating">
    <track src="chapters.ttml" kind="chapter" label="chapter, subchapter and paragraph level navigating">
  </video>

Parsed JS representation of either

TextTrackCueList {
  length : 3,
  TextTrackCue(0) {
    [..]
    startTime: '00:00:00.760',
    endTime: '00:0:03.450',
    getCueAsSource(): "Chapter 1",
    getCueAsHTML(): "Chapter 1"
  },
  TextTrackCue(1) {
    [..]
    startTime: '00:00:03.450',
    endTime: '00:0:08.000',
    getCueAsSource(): "Chapter 2",
    getCueAsHTML(): "Chapter 2"
    getCueAsSource():
    "Chapter 2
     <nav>
       <00:00:03.450>Subchapter 1
       <00:00:04.000>Subchapter 2
       <nav>
         <00:00:04.000>Paragraph 1
         <00:00:04.500>Paragraph 2
         <00:00:05.000>Paragraph 2
       </nav>
       <00:00:06.000>Subchapter 3
     </nav>",
    getCueAsHTML(): "Talk on WebVTT
      <ul>
        <li><? target='timestamp' data='00:00:03.450'>Subchapter 1</li>
        <li><? target='timestamp' data='00:00:04.000'>Subchapter 2</li>
        <ul>
          <li><? target='timestamp' data='00:00:04.000'>Paragraph 1</li>
          <li><? target='timestamp' data='00:00:04.500'>Paragraph 2</li>
          <li><? target='timestamp' data='00:00:05.000'>Paragraph 3</li>
        </li>
        <li><? target='timestamp' data='00:00:06.000'>Subchapter 3</li>
      </ul>"
  },
  TextTrackCue(2) {
    [..]
    startTime: '00:00:47.600',
    endTime: '00:07:37.900',
    getCueAsSource(): "Chapter 3",
    getCueAsHTML(): "Chapter 3"
  },
  TextTrackCue(3) {
    [..]
    startTime: '00:00:08.000',
    endTime: '00:00:16.000',
    getCueAsSource(): "Impact of Captions on the Web",
    getCueAsHTML(): "Impact of Captions on the Web"
  },
  etc.
}