WebVTT
The Web Video Text Tracks Format

W3C First Public Working Draft
13 November 2014

http://www.w3.org/TR/2014/WD-webvtt1-20141113/

Introduction

WebVTT, is a file format that allows to mark up external text tracks.
The primary purpose of WebVTT files is to add subtitles to a <video>.
It is used for displaying timed text tracks with the HTML5’s <track> element.

Others W3C Text Track Formats

There are others text track formats defined by the W3C, which are finalized as RECs

TTML1 (DFXP) Timed Text ML/ Distribution Format Exchange Profile (DFXP)
Content type that represents timed text media for the purpose of interchange among authoring systems.
TTML Content may be used as a distribution format. Broadcasting
http://www.w3.org/TR/2013/REC-ttml1-20130924/
SMILtext (text track embeded in SMIL or referenced as external file)
smilText has been designed as a functional subset of DFXP.
http://www.w3.org/TR/2008/REC-SMIL3-20081201/smil-text.html

Both formats are XML.

Origin and Specification

WebVTT derives from a format originally called WebSRT (Web Subtitle Resource Tracks) was specified by the WHAT W G for the proposed HTML5 <track> element.
Work was picked up by the Web Media Text Tracks Community Group. (See editor's draft specification)
In March 2014 the W3C Timed Text Working Group was chartered to take it on REC track.
This document is intended to become a W3C Recommendation.
This document is governed by the 1 August 2014 W3C Process Document.

Long discussion for having one or two Timed Text WGs :

Resolution was a fusion of the 2 technos in a single TTWG (specifying 2 different formats TTML and WebVTT).
Two co-chairs:
- Nigel Megitt from the BBC (Broadcasting)
- David Singer from Apple ( Browser).
The latest has step down after been nominated at the AB. Therefore we are currently missing a Chair.

Timed text tracks

Using the <track> element from HTML5, we can associate information such as

subtitles for video content
captions for video content
text video descriptions
chapters
and display them synchronized with the media resource.

using the metadata (<track> kind ) attribute.

WebVTT Format

WebVTT is a text based format.
The mime type of WebVTT is text/vtt.
Filename extension : .vtt
A WebVTT file must be encoded in UTF-8 format.

WebVTT file structure

WebVTT Body

A "webVTT" string:
WEBVTT - This file has cues.(a blank line follows)

WebVTT Comment

comment with NOTE string:
NOTE This is a comment

WebVTT Cues

A cue is a single subtitle block that has a single start time, end time, and textual payload.
An optional cue identifier followed by a newline.

A cue consists of five components:

An optional cue identifier
Cue timings indicating when the cue is shown.
(It has a start and end time which are represented by timestamps).
Optional cue settings to position cues at explicit positions in the video viewport.
The cue payload text which contains the text or subtitles to be displayed
Optional cue component to add a CSS class, voice label <v>, italic, bold,etc.

[idstring] 

        [hh:]mm:ss.msmsms -->
      [hh:]mm:ss.msmsms [cue settings] 

      Text string

Example : consists of the header, a blank line, and then 3 cues separated by blank lines.

WEBVTT

1
00:00:01.000 --> 00:00:10.000
This is the first caption, displaying from 1-10 seconds

2
00:00:12.739 --> 00:00:24.074
This is the second caption.

NOTE This line may not translate well.

3 - Title Cue with settings
00:00:34.159 --> 00:00:35.743 line:0 position:20% size:60% align:start
Third caption

Cue Identifier

The identifier is a name that identifies the cue.
It can be used to reference the cue from a script. It must not contain a newline and cannot contain the string "-->". It must end with a single newline. They do not have to be unique, although it is common to number them (e.g. 1, 2, 3, ...).

Cue Timings

A cue timing indicates when the cue is shown.
It has a start and end time which are represented by timestamps.
The end time must be greater than the start time, and the start time must be greater than or equal to all previous start times. Cues may have overlapping timings.

If the WebVTT file is being used for chapters (<track> kind is chapters) then the file cannot have overlapping timings.

Each cue timing contains five components:

Timestamp for start time
At least one space
The string "-->" (the escape sequence "&" for ampersand and ">" for greater-than)
At least one space
Timestamp for end time
- Which must be greater than the start time

The timestamps must be in one of two formats:

mm:ss.ttt
hh:mm:ss.ttt

Example : Overlapping cue timing examples

00:00:00.000 --> 00:00:10.000
00:00:05.000 --> 00:01:00.000
00:00:30.000 --> 00:00:50.000

Example : Non-overlapping cue timing examples

00:00:00.000 --> 00:00:10.000
00:00:10.000 --> 00:01:00.581
00:01:00.581 --> 00:02:00.100

Cue Settings

Allows to position cues at explicit positions in the video viewport.
Cue settings are optional components used to position where the cue payload text will be displayed over the video. This includes whether the text is displayed horizontally or vertically.
There can be zero or more of them, and they can be used in any order so long as each setting is used no more than once.
The cue settings are added to the right of the cue timings. There must be one or more spaces between the cue timing and the first setting and between each setting. A setting's name and value are separated by a colon. The settings are case sensitive so use lower case as shown.
There are five cue settings:

vertical

Indicates that the text will be displayed vertically rather than horizontally, such as in some Asian languages.

Table 1 - vertical values
`vertical:rl`	writing direction is right to left
`vertical:lr`	writing direction is left to right

line

Specifies where text appears vertically. If vertical is set, line specifies where text appears horizontally.
Value can be a line number
- The line height is the height of the first line of the cue as it appears on the video
- Positive numbers indicate top down
- Negative numbers indicate bottom up
Or value can be a percentage
- Must be an integer (i.e. no decimals) between 0 and 100 inclusive
- Must be followed by a percent sign (%)

Table 2 - line examples
	`vertical` omitted	`vertical:rl`	`vertical:lr`
`line:0`	top	right	left
`line:-1`	bottom	left	right
`line:0%`	top	right	left
`line:100%`	bottom	left	right

position

Specifies where the text will appear horizontally. If vertical is set, position specifies where the text will appear vertically.
Value is a percentage
Must be an integer (no decimals) between 0 and 100 inclusive
Must be followed by a percent sign (%)

Table 3 - position examples
	`vertical` omitted	`vertical:rl`	`vertical:lr`
`position:0%`	left	top	top
`position:100%`	right	bottom	bottom

size

Specifies the width of the text area. If vertical is set, size specifies the height of the text area.
Value is a percentage
Must be an integer (i.e. no decimals) between 0 and 100 inclusive
Must be followed by a percent sign (%)

Table 4 - size examples
	`vertical` omitted	`vertical:rl`	`vertical:lr`
`size:100%`	full width	full height	full height
`size:50%`	half width	half height	half height

align

Specifies the alignment of the text. Text is aligned within the space given by the size cue setting if it is set.

Table 5 - align values
	`vertical` omitted	`vertical:rl`	`vertical:lr`
`align:start`	left	top	top
`align:middle`	centred horizontally	centred vertically	centred vertically

These cue settings allow you to specify the position and alignment of the cue text, and the following options are available:

Setting	Value(s)	Function
vertical	rl \|\| lr	Aligns text vertically to the left `lr` or right `rl` (e.g. for Japanese subtitles)
line	[-][0 or more]	References a particular line number that the cue is to be displayed on. Line numbers are based on the size of the first line of the cue. A negative number counts from the bottom of the frame, positive numbers from the top
	[0-100]%	Percentage value indicating the position relative to the top of the frame
position	[0-100]%	Percentage value indicating the position relative to the edge of the frame where the text begins (e.g. the left edge in English)
size	[0-100]%	Percentage value indicating the size of the cue box. The value is given as a percentage of the width of the frame
align	start \|\| middle \|\| end	Specifies the alignment of the text within the cue. The keywords are relative to the text direction

Note: if no cue settings are set, the positioning default to the middle, at the bottom of the frame.

Cue Payload

The payload is where the main information or content is located.
the payload contains the subtitles to be displayed.
The payload text may contain newlines but it cannot contain a blank line, which is equivalent to two consecutive newlines. A blank line signifies the end of a cue.

WebVTT Cue Components

In addition to all this, you can use “WebVTT cue components” to add further information to the actual cue text itself.
These components are similar to HTML elements, and can be used to add semantics and styling to the actual text strings.

A list of the different components available is given below:

Value	Meaning
c	Specifies a CSS class, which follows the `c`, e.g. `<c.className>Cue text</c>`
i	Specifies italic text
b	Specifies bold text
u	Specifies underlined text
ruby	Specifies something similar to HTML5’s `<ruby>` element. Within this component, one or more occurrences of a `<rt>` element are allowed. (The HTML5 `<ruby>` element in words of one syllable or less)
v	Specifies a voice label (if provided) that the cue text is being “spoken in”, e.g. `<v Ian>This is useful for adding subtitles</v>`. Note that the voice label won’t be displayed. It’s just there as a styling hook.

Voice Label

Using Voice label

The caption may display the voice (Emo) in addition to the caption text.
The name of the voice can be read by a screenreader, possibly event using a different voice for male or female names.
It offers a hook for styling so that, for example, all captions for Emo could be in blue.

This example uses a voice label for the cue text, Emo.
In addition, a CSS class of question is specified, which can then be used for styling purposes.
A class such as this can be styled in the usual way via CSS attached or defined in the calling HTML page.

Cue-CSS and Voice
00:00:52.000 --> 00:00:54.000 align:start size:15%
<v Emo>I don’t <i>think</i> so. <c.question>You?</c></v>

Note that to style cue text with CSS, you need to use a special pseudo-element selector, for example:

::cue(v[voice="Emo"]) {color:blue}
::cue(i) { font-style: italic }
::cue(.question) { font-size: 2em }

The following properties apply to the '::cue' pseudo-element with no argument; other properties set on the pseudo-element must be ignored:

'color', 'opacity','visibility', 'text-decoration','text-shadow', the properties corresponding to the 'background' shorthand, the properties corresponding to the 'outline' shorthand, the properties corresponding to the 'font' shorthand, including 'line-height','white-space'.

Timestamps in cue text

It is also possible to add timestamps to cue text, indicating that different parts occur at different times.
An example of this is shown below:

Cue-paint-on captions 

        00:00:52.000 --> 00:00:54.000 <c>I
        don’t think so.</c>
        <00:00:53.500><c>You?</c>

This will cause all the text to be displayed at the same time, but do note that in supporting browsers you will be able to use the :past and :future pseudo classes to style text differently depending if it is in the future or past.
For example:

::cue(c:past) {color:yellow} 
::cue(c:future) {text-shadow: black 0 0 1px;}

Ruby in cue text

00:00:15.042 --> 00:00:18.042 D:vertical align:
start
<ruby>左<rt>ひだり</rt></ruby>に<ruby> 見<rt>み</rt></ruby>えるのは…
 
00:00:20.417 --> 00:00:21.917 D:vertical align:
start
..…首刈り機

Example of WebVTT file

Example using CSS style defined in header using ::cue pseudo-element.

::cue (c.dream) {color:#ffff}

WEBVTT


00:00.000 --> 00:14.999
Elephant's <c.dream>Dream</c>

NOTE CSS class, styled with ::cue pseudo-element

00:15.000 --> 00:18.000 align:end line:10%
At the <i>left</i> we can <b>see</b>...

NOTE Relative and percentage based positioning

00:18.167 --> 00:22.000
At the right <00:20.000>we can see the...

NOTE Karaoke style split line

Example using region within the viewport


WEBVTT
Region: id=fred width=50% lines=3 regionanchor=0%,100% viewportanchor=10%,90% scroll=up
Region: id=bill width=50% lines=3 regionanchor=100%,100% viewportanchor=90%,90% scroll=up

00:00:00.000 --> 00:00:20.000 region:fred align:left
Hi, my name is Fred

00:00:02.500 --> 00:00:22.500 region:bill align:right
Hi, I'm Bill

Implementation / Support

Support for the new format is limited but growing.

Google's Chrome and Microsoft's Internet Explorer 10 browsers support <track> tags with .vtt files for HTML5 videos already.
Firefox implemented WebVTT in its nightly builds (Firefox 24), but initially it was not enabled by default. As of July 24th 2014, Mozilla has enabled WebVTT on Firefox by default.
YouTube began supporting WebVTT in April, 2013.

http://www.w3.org/community/texttracks/

Example of WebVTT references in HTML5

 <video controls>
    <source src="elephants-dream.mp4" type="video/mp4">
    <source src="elephants-dream.webm" type="video/webm">
  <track label="English subtitles" kind="subtitles" srclang="en"
        src="elephants-dream-subtitles-en.vtt" default>
  <track label="Deutsche Untertitel" kind="subtitles" srclang="de"
        src="elephants-dream-subtitles-de.vtt">
  <track label="English chapters" kind="chapters" srclang="en"
        src="elephants-dream-chapters-en.vtt">
  <track label="English captions" kind="captions" srclang="en"
        src="elephants-dream-captions-en.vtt">
  <track label="English descriptions" kind="descriptions" srclang="en"
        src="elephants-dream-descriptions-en.vtt">
</video>

Demo

A nice demo using styling with class and Timestamps in cue text(karaoke style)

http://www.leanbackplayer.com/test/webvtt.html

Feedback

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy.
send comments regarding this document to public-tt@w3.org

Resources

https://developer.mozilla.org/en-US/docs/Web/API/Web_Video_Text_Tracks_Format
http://en.wikipedia.org/wiki/SubRip#WebVTT
https://dev.opera.com/articles/an-introduction-to-webvtt-and-track/
http://html5videoguide.net/presentations/WebVTT/#title-slide
http://www.leanbackplayer.com/other/webvtt.html

WebVTT The Web Video Text Tracks Format

W3C First Public Working Draft 13 November 2014