Position Statement
"WebVTT in HTML5 for video accessibiity and beyond"
Silvia Pfeiffer

The purpose of this statement is to give an introduction to WebVTT [1]
to build a basis for discussion of requirements that come from TV use
cases to time-synchronized Web applications.

HTML5 is offering a generic means to associate time-aligned text or
metadata with audio and video resources through the new <track>
element [2]. In theory, <track> can accept any number of file formats
as input - similar to how <img>, <video> and <audio> can in theory
accept any image, video or audio resource as input. In practice,
however, the choice of file format is restricted by what the browser
vendors will develop support for.

One particular format that has been custom developed for HTML
requirements and is in the process or planned to be implemented by
most modern desktop browsers is the WebVTT file format. WebVTT is
short for Web Video Text Tracks. It is a line-based file format that
simply supplies data to the audio or video element by time interval
within the timeline of the media resource. The time-intervals are
called "cues". WebVTT thus provides a generic platform for
time-synchronized application use cases around HTML5 audio and video.
The main use cases that motivated the creation of <track> and WebVTT
are in accessibility to provide text alternatives along the timeline.

The use of WebVTT for captions and subtitles has been described in
detail. WebVTT's functionality compares to that of modern TV caption
formats. Positioning and cue size are specified through cue settings.
A subpart of CSS has been specified to be applicable to WebVTT cue
styling. In this way, WebVTT can also be used outside Web browsers by
applications that do not support a full CSS engine but can implement
support for the small number of styling commands specified.

The specification of how to use WebVTT for DVD-style chapters has been
detailed recently. It allows for a hierarchy of chapters with
arbitrary depth, which is very useful for navigation purposes. When
made keyboard accessible, this hierarchical access will satisfy the
navigation needs of blind users, and is equally useful to any user.

WebVTT can also be used to provide text descriptions for media
resources. While this has not been specified in depth yet, it is
expected that in the first instance, WebVTT cues for text descriptions
will be purely textual without markup. These can be rendered through
screen readers using browser accessibility APIs in a similar manner to
how active regions are rendered. Further developments here are
possible to e.g. provide prosody to voicing, even though that is not
yet a typical feature supported by screen readers.

The <track> element and WebVTT have been developed also with uses
cases beyond these mainly accessiblility-motivated requirements in
mind. For these use cases there is a catch-all kind of track, which is
called "metadata". It can be used for any type of timed metadata or
timed text use case. Examples of such use cases could be timed and
positioned annotations (similar to how YouTube's annotations work),
timed geo-coordinates, or timed and positioned hyperlinks. The
rendering has to be provided through JavaScript and as such, the way
in which the data is specified will be custom and can take on any
form, including JSON or XML.

A WebVTT working group is in the process of creation at the W3C. Thus,
given this understanding of existing capabilities of HTML5 for
time-synchronized data, this position paper would like to explore what
further standardisation needs we may be able to foresee.


References:
[1] WebVTT specification:
http://www.whatwg.org/specs/web-apps/current-work/webvtt.html
[2] Track element:
http://www.w3.org/TR/html5/the-iframe-element.html#the-track-element