From HTML WG Wiki
Jump to: navigation, search

(This is perhaps in the wrong place in the Wiki. Chaals apologises for that)

This is based on work done by the Media Subgroup of HTML Accessibility Task Force, Silvia Pfeiffer (Google), John Foliot, Janina Sajka, Charles McCathie Nevile and further worked on by the HTML accessibility Task Force directly.

Note that this version (2015-04-15) has been revised by Chaals, Cynthia Shelly, Léonie Watson and John Foliot, and is yet to be formally reviewed by the Task Force.


This is a proposal to address the need for discovery of video and audio transcripts on HTML5 media elements. It is based on an analysis of use cases for video transcripts and proposes a machine-discoverable, unified approach to realizing them.

The use cases

UC1: A full text transcript for the media asset is provided with the media resource in a separate but linked resource. (also T1 in the Media Accessibility User Requirements)

Publishers that don't have captions often replace them with links to non-timed transcripts because WCAG2 Success Criteria 1.2.1 explicitly mentions this as a solution. These linked documents are not necessarily in HTML - e.g. using DAISY for navigability, or reproducing WebVTT / TTML captions as a usable transcript.


We sometimes see pages publish the transcript of an event without actual video or audio recordings.


UC2: A full text transcript for the media asset is provided as text on the same page of the media resource. (also T2 in the Media Accessibility User Requirements)

Examples of non-timed transcripts published underneath the video on-page:

UC3: A full text interactive transcript for the media asset is provided as text on the same page with the media resource and scrolls along in sync to the media resource. (also T2 in the Media Accessibility User Requirements)

Publishers that have a timed transcript (e.g. captions) provide an interactive transcript next to/underneath their videos.


Video solution providers:

This is a paradigm that Web developers often implement and often get wrong, so providing it by a Web browser is exciting new functionality.

Interactive transcripts are particularly useful to blind and vision-impaired users: they can scan through the text in the transcript with a screen-reader and click to activate video playback at a point in time that is of interest to watch the video/audio. This is similar to chapter markers, except that the full text transcript is being used to scan through the video rather than some (typically scant) chapter markers.

Interactive transcripts can be visually distracting. Therefore, browsers may provide an interactive means to allow users to hide interactive transcripts, e.g. by rendering a "minimize" control on the rendered transcript.

UC4: A search engine wants to discover transcripts In order to enhance video search, facilitating e.g. using a snippet of the dialogue as a key for identifying matching videos.

The Requirements

R1: Discoverability - the end user (sighted or otherwise) can discover that there is a transcript available; machines (AT, search engines, syndication) can discover that there is a transcript available.

R2: Choice to consume - the option to consume or not consume the transcript remains in the control of the user.

R3: Rich text transcripts - transcripts should be able to support richer content than flat text, including HTML, WebVTT, TTML, RTF, Daisy or other formats.

This is important to meet some effective sub-requirements:

  • the transcript should be stylable for design aesthetic, including the possibility to include it in the video controls, render it fullscreen, etc.
  • the transcript needs to be embeddable in the video player, rendered full-text on-page, or consumed as a separate page e.g. to avoid downloading a video that isn't going to be watched.

R4: Stand alone transcripts - transcripts need to be available even in browsers that do not support or do not render audio or video elements. In fact, it should be possible to render transcripts without requiring a media element be present on the same page.

R5: Multiple transcripts - transcripts may be available in multiple languages and/or formats so making multiple links available must be possible.

Nice things to provide

N1: Retrofitting - it should be easy for authors who are already publishing content with transcripts to retrofit their existing pages.

N2: No link duplication - transcript link duplication should be avoided or minimised.

Proposed Solutions

Necessary and sufficient to meet the requirements is a mechanism to link from a video or audio element to a transcript.

It would be nice to have an identifiable semantic element that can contain a transcript, as discussed below.

We have analysed 5 possible approaches to discovery of the transcript:

A transcript attribute on the video element
This is simple to implement but does not easily allow linking to multiple transcripts
Using a link element with a defined value for the rel attribute to link to a transcript
This seems the preferred solution, as discussed below
Defining a new value for the kind attribute to use with the track element
This seems equivalent to the previous approach, but has attracted pushback in particular because transcripts are not typically time-based, which is a characteristic of other kinds of track.
Using a transcript element with a src attribute to point to a transcript.
This is semantically equivalent to the last two solutions. However, because we want to define a transcript container element separately, overloading semantics with a link seems unnecessarily complex. Minting this element just as a link precludes having a container element later.
Using a transcript element with content to embed a full transcript, either as a child of a media element or linking back to the video source.
Putting the content inside the video element introduces complexities for rendering, backward compatibility, and doesn't allow for linking to externally hosted transcripts.
Linking back to the video source assumes there is a single video using the same transcript, which is not necessarily the case.

Preferred solution...

Our preferred solution using link rel="transcript" meets the requirements identified, and we believe it is easy to implement across the ecosystem.

A transcript element as discussed below may be used as a container to identify the transcript, although it is not required.

In a page the markup could look like

<video controls>
 <source src="video.rm">
  <!-- A link to a transcript within the same document -->
 <link rel="transcript"
   title="English transcript" href="#theText">
  <!-- A link to an external transcript in french uses hreflang -->
 <link rel="transcript" hreflang="fr"
  lang="fr" title="Transcription en francais">
 <track kind="captions" src="YouGetTheIdea?Right" lang="ru">
<transcript id="theText">This is the english language

Note that the HTML spec currently says that "if the rel attribute is used, the element is restricted to the head element", so the link element could not be used within a media element unless this restriction is amended.

A transcript element

While we consider this is not strictly necessary, we think it would be a useful addition to HTML, as a container for a transcript.

  • It provides readily identifiable semantics for code readability
  • It enables easy styling and interaction management of transcripts (e.g. for browsers or extensions, using it as a landmark role, etc)
  • It simplifies having multiple transcripts within a single page such as for a collection of short videos, enabling user agents to know where the end of one transcript is, and only render the relevant content.

Feedback from Media TF Face-to-Face (04/15/2015)

  • Despite earlier push-back, many at the meeting thought that the <track> element *would* be a good fit for this (alternatively <source>)
ddavis: the spec say you can have a list of 0 or more cues
... you could use track element now
  • P. Cotton suggests involving others:
paulc: I would suggest you take another run while you are here. And then send
it out to a candidate list of places
... public-html, web-and-tv-ig
  • We need to clarify the Best Practices stuff, and how this would work (when it works).

From an accessibility perspective, the name of the child element inside of the video element that directly links to the transcript file is secondary to the need for and final functionality of the element - it can become a bike-shedding exercise.

Using the code sample above, replacing <link...> with <track...> or <source...> would not directly impact assistive technology. Mark Vickers (Comcast) noted that <source> might be a more natural choice, as the Transcript is a first-class alternative to the video (rather than a support piece).

Feedback around the introduction of a new landmark-like container element (<transcript>) was generally neutral - no-one seemed to a strong opinion one way or the other.

Other ideas - discussion, why not

<track kind="transcript" src="#theTranscript">

This has attracted pushback because track has otherwise been used to point to resources that synchronise, which is generally not a requirement for transcripts.

<video transcript="#someURL">

While this is simple to implement, it fails to cater nicely for multiple transcripts (R5). Since video often has content already, using an attribute is unnecessary, and the potential problem of copy/paste leading to bad links being transferred can be avoided.

 <transcript src="#aTranscript">

This conflicts with having a transcript container, unless we overload the semantics which seems complex.

<video src="video" controls>
 <transcript lang="en">
  A transcript embedded in the element
 <transcript lang="es">
  Una transcripcion del video

This doesn't allow for linking to external transcripts easily.

<video id="theVideo" src="video" controls></video>
<transcript for="#theVideo">
 Some transcript

This doesn't work well for external transcripts, where the source video may be moved, or shared - there is no nice way to put multiple links if there are multiple copies of the video referring to the same transcript - e.g. where a video is hosted on several services.

Historical notes

Issue 194 asks for a mechanism for associating a full transcript with an audio or video element. It does so by stating some requirements and a single use case, namely a link to an off-page transcript resource.

In the arguments of the different suggested Change Proposals, many different use cases appear, not just the off-page link use case.

The following Change Proposals were made in relation to this issue...

The relevant bug on HTML is: Bug 12964 - <video>: Declarative linking of full-text transcripts to video and audio elements