This wiki has been archived and is now read-only.


From HTML WG Wiki
(Redirected from MultimediaAccessibilty)
Jump to: navigation, search

Multimedia Accessibility <Audio> <Video>


HTML 5 lacks a mechanism to allow people with disabilities to access multimedia content. In particular the new audio and video elements need accessibility evaluation and review.

Multimedia presentations (rich media) usually involves image, sound and motion. This can present accessibility barriers to some people with disabilities, for instance visual impairments, hearing loss, photosensitive epilepsy, cognitive and learning disabilities, attention deficit disorder, and dyslexia.

Visual impaired users can't directly access the visual components of a multimedia presentation. Likewise, users who are deaf or hard of hearing will not be able to directly access auditory information. Motion stimuli can adversely affect people with epilepsy, attention deficit disorder, and dyslexia.

To address the multiple accessibility issues of multimedia, mapping and controlling media assets with some kind of machine-recognizable mechanism would allow users to use the format that best suits their needs. People have different ways of processing information and allowing multimedia users to shift at will among audio, video, graphic, and written media may help make these technologies more accessible.

In an ideal world, the accessibility features would be in the video. In the real world, often they aren't. The page creator may not be able to modify the audio or video. Sometimes this is a matter of not having the video (embedded 3rd party videos) or not having legal authority; sometimes it is just a matter of not knowing how. By all means encourage authors to put the accessibility information within the video. But there needs to be a fallback for cases where that doesn't happen.



How accessibility works for <video> is ISSUE-9 in the Tracker. In discussion on public-html and at the September 11, 2008 teleconference.

Regarding fallback the audio and video sections of the spec currently says:

"Content may be provided inside the video element. User agents should not show this content to the user; it is intended for older Web browsers which do not support video, so that legacy video plugins can be tried, or to show text to the users of these older browser informing them of how to access the video contents. In particular, this content is not fallback content intended to address accessibility concerns. To make video content accessible to the blind, deaf, and those with other physical or cognitive disabilities, authors are expected to provide alternative media streams and/or to embed accessibility aids (such as caption or subtitle tracks) into their media streams."

The editor's rationale is, "...a fundamental principle of how this feature was designed is that any accessibility features and metadata features must be within the video or audio resource, and not in the HTML markup. The hypothesis is that this results in the optimal experience for all users."

A serious concern may be whether the audio/video resource supports accessibility features, and if it does, whether client-side software can extract it. This approach also requires the audio/video file to be retrieved over the network, unless the textual alternatives are at the start (prior to the audio/video data). The latter is problematic for those with slow connections or strict band width quotas.

Advice Request to PFWG

Change Proposals


Related Bugs

Proposed Solutions

Media MultitrackAPI

A Media MultitrackAPI draft proposal from the HTML Accessibilty Task Force introduces a JavaScript API for extracting basic information about the tracks contained inside a media resource (audio or video), which may include audio descriptions, sign language video, closed captions or subtitles.

Media TextAssociations

A Media TextAssociations draft proposal from the HTML Accessibilty Task Force introduces declarative markup into the audio and video elements of HTML5 to link to external resources that provide text alternatives for different roles, such as captions, subtitles and textual audio descriptions. This includes styling defaults and a resource selection algorithm for when there are alternative resources available.

Explicit Association with Separation of Media Assets and a Preference-Style Selection Mechanism

A clean, semantic, explicit association to transcripts, text descriptions, captions, audio descriptions and/or streams that could be toggled on or off by the end user would be very beneficial as these items will get lost if they are outside the parent element. We need a way for authors to link items together, within the element. This model would ensure that the linkage is there, and if the author chooses to also provide an in-the-clear linkage to one or more of these support pieces than this is a win.

Multi-media = multi-modal/multi-sensory so there are many permutations where one or more "modes" may not be available, so we need to try and address each mode as a separate entity as well as the default "combined" or multi-media asset. The most suitable file available should be made available to the user, in accordance with user and user agent preferences and capabilities.

Separation of these media assets is key as they are different from one another. They include but are not limited to:

Some kind of mechanism is needed. Separate attributes for media assets on <video> <audio> could do it; e.g. longdesc, transcript, etc.

In any event ideally all the support pieces should be direct descendants (children) of the parent <video> or <audio>, but unique and separate. In a perfect world all would be supplied, but even when less than perfect a method to provide one or more support pieces should exist.

The client side should always have the right to have "what are my choices (i.e. options)" answered, rather than having to answer, "what are your choices (i.e. preferences)" to the server. The server can still track which option gets exercised. But in the end if the user wants to take the time to browse the versions rather than having one picked in the feed forward processing, they should have that capability.


The following examples hint on some best practices and suggest the way that not only native players and but HTML 5 as a whole approaches multi-media content. Keeping assets separate allows for a finer granular control over those pieces, whereas bundling everything together as a single file may mean larger file sizes, require more post processing on the user-end to achieve full access regardless of AT used, etc.

In an Example at Stanford, the source code has params for the video, the caption. JW FLV (an open-source flash player) works by keeping all "elements" of the total on-screen presentation separate: the .flv (or recently H.264 .mov - if the user has Flash 9) is referenced via a parameter setting, so too a static JPEG as the opening screen-shot, the semi-transparent "logo" (watermark) and the time-stamped transcript (currently .srt, but apparently also DFXP XML, although this seems somewhat tricky at this time).

Because the text transcript remains external to the media (as opposed to "embedded") it becomes easier to re-purpose and share that piece of the "Multi" multi-media asset. It would probably require less user-end processing to output the transcript as Braille-output (for example) then to require a program or user-agent to re-process embedded content in the media asset to "extract" this information, and provide the alternative output.

Added benefits could be provided through further enhancements through some relatively simple manipulations. For instance in a second Stanford example the subtitle is as straightforward as getting a translation. Some simple scripting provides multi language support. The full transcript is available via the transcript link.

  • Search the transcript for the term 'repackage'.
  • Select that word.
  • The video jumps to that place in the clip.

It is uncertain how this could be as easily achieved with embedded transcripts.

Provide Reserved rel Attribute Values

Perhaps we can provide some attribute values for the rel attribute along these lines:

*"longdesc"or "textdesc"


Using the <figure> element:

<figure id="figVideo1">
  <video [...][...] </video>
    <a href="..." rel="transcript">
      Obtain the transcript.
   <a href="..." rel="textdesc">
      Obtain the text description.
    <a href="..." rel="download">Download Quicktime version.</a><br>
    <a href="..." rel="download">Download Ogg Theora version.</a>

User agents could then present as they like, while legacy UAs would just provide the fallback for the video plus a series of links. Plus, if UAs choose not to provide any special handling of the content, at the very least it's still accessible to everyone. Editors comment on this proposal: "I encourage people to register these rel="" values in the wiki and to try them. I am very interested in what experience with this teaches us. If it turns out to be a good idea, it's definitely something we could add to the spec."

A Unified Approach to HTML5's Media Specific Elements

Towards A Unified Approach to HTML5's Media Specific Elements, version 0.1 - Gregory Rosmaita.

Automatic Selection of Media Files Based on User Preference

Accessibility for the Media Elements in HTML5 Proposal - DW Singer et al. Summary of email discussion on this proposal:

  • Media queries, though similar, are probably not right for the user-needs matching.
  • We don't need to handle the 'what fallback is shown if no source matches' problem since it's easy to write the HTML so that the case doesn't arise (at least, because of accessibility filtering).
  • Transcripts and other non-temporal annotative information might be wanted both (a) by users also viewing the content (i.e. perhaps non-accessibility related) and (b) accessibility users, so they should not be expressed as an alternative to the media.

Single File With Separate Source Tracks

Employ flexible authoring of video and audio content that does not rely exclusively on the capabilities of the various container formats for alternate tracks.


<source' media='<a media query>' >
<track src='avideofile' >
<track src='anaudiofile' >
<track src='acaptionfile' languages='<language metadata>' >
<track src='asubtitlefile'' languages='<language metadata>' >
<track src='anothersubtitlefile'' languages='<language metadata>' >
<source src='afile2' media='<a media query>' ></source>
<source src='afile3' media='<a media query>' ></source>

Variation using text and category:

<video src="http://example.com/video.ogv" controls>
  <text category="CC" lang="en" type="text/x-srt" 
  <text category="SUB" lang="de" type="application/ttaf+xml"   
  <text category="SUB" lang="jp" type="application/smil" 
  <text category="SUB" lang="fr" type="text/x-srt" 

Variation using text and the role attribute:

<video src="http://example.com/video.ogv" controls>
  <text role="CC" lang="en" type="text/x-srt" 
  <text role="SUB" lang="de" type="application/ttaf+xml"   
  <text role="SUB" lang="jp" type="application/smil" 
  <text role="SUB" lang="fr" type="text/x-srt" 

This solution, like SMIL, turns a user agent (Web browser) into a media parsing and compositing unit. Another approach may be to have such a compositiong language available on a Web server who knows which tracks it has available for a user agent and the user agent can communicate with the server to determine which compisition it would like. The server then composes the right media file together and streams it off. This kind of compositing is required also if we want to be able to deliver just fragments of a media file.


Add SMIL as yet another extension format supported within HTML5 or referenced for embedding by the src attribute (or finally bite the bullet and add an IE8 / 'XML namespaces' compatible namespace mechanism to HTML5 ). SMIL has switches to provide alternative formats including text. One can switch between anything, video, audio, text, image, animation, graphics, whatever might be appropriate to replace the inaccessible format. Maybe HTML should at least have a meta element for any element as a container for structured meta information or alternatively elements to structure the meta information in the head and the ability to point to fragments to identify the target of the structured meta information. This could already help authors to provide useful descriptions for problematic content, for whatever reason the content is considered to be problematic to understand. Using the SMIL Meta information module and RDF, SMIL provides similar possibilities too.



  • This would provide the needed machine-recognizable association.
  • IE currently doesn't support ARIA.
  • Part of the intention of ARIA is to be replaced by host-language functionality once the @aria-* attributes are no longer necessary. It may be better to have a more general global equivalent attribute that allows a list of IDREFs to label an arbitrary element.
  <video ... aria:describedBy="transcript"> ... </video>
  <legend>Video Title (<a href="#transcript">view
<div id="transcript"> ... </div>

Object Element

It has been suggested that the video and audio elements are not really even needed, and the big advancement would be to bring the attributes and DOM interfaces for those elements to the already supported object element. The source element might also be useful. Instead of using HTML:img, HTML5:audio, HTML5:video, HTML5:canvas etc authors can simply always use HTML:object referencing a document of the formats SVG, SMIL, DocBook etc to get access to an advanced method to provide alternatives and meta information, or they can use a compound document - but this is doesn not not really the intended concept of HTML5? Obviously such an approach increases the requirements for a user-agent to provide the information at all, because more formats are involved than necessary for this purpose. (It has also been suggested that desirability of these elements is not the subject of this issue; the accessibility of multimedia is, and that the impact of the use of object element is not clear.)

Introducing DOM and UA UI to access media metadata

Allowing UANormAndDOMForMediaPropeties: DOM and UI access to media immanent metadata]] for audio, video, still images and other non-text media would provide some access to text equivalents, even when the HTML author fails to provide them.

In the Clear Hyperlink with No Preference-Style Selection Mechanism

Using the <a> element to provide an ordinary link to a transcript or full text description, possibly including the link within the video caption. This solution does not provide a synchronize equivalent alternative or an on/off toggle. However, placing the link within the video caption provides an association with the video itself. Easy for authors and some end users. But forcing people to put all the content directly into a page with big visible links simply won't fly. The zillions of dollars put into techniques for hiding stuff in the misguided hope that they still appear for the people who need them (image replacements, stuff positioned off-screen, and so on) show that designers would rather expend considerable effort and money than actually make the data visible. However, transcripts are useful for more people, including non-disabled people, and incorporating such a feature into the page design of a page is not difficult, and common practice.


  <video src="/videos/diary-2008-09-11.mkv"></video>
  <legend>My school holiday trip to Cairns.
    <a href="/transcripts/diary-2008-09-11.html">Read transcript</a>.</legend>

Configuring the Chosen Media Resource

Some media systems have the capability of configuring the media resource. For example in track-based resources such as MP4 files or QuickTime movies, tracks can be enabled or disabled. In other systems such as 3GPP DIMS, MPEG4 LASeR or Adobe Flash, the media resource may have embedded scripts etc. which could react to user preferences, as well as presenting affordances to configure the resource. Since it could be tedious for a user with an accessibility need to configure, manually, every resource, may be desirable if it's possible for resources to adapt, when possible, based (at least initially) on user preferences. Conversely, it is frustrating for end users of all stripes to have to be locked in to a specific default. The toggling of captioning from the player controls would be beneficial. For for instance a hearing enabled user, may want to toggle captions on or off, perhaps even mid-stream in a media presentation, so ease of user-choice should be paramount.

User Roles and Cases

Visual Impaired Users

Visual impaired users can’t directly access the visual components of a multimedia presentation.

Deaf or Hard of Hearing

Users who are deaf or hard of hearing will not be able to directly access auditory information.

People with Photosensitive Epilepsy

People with photosensitive epilepsy can have seizures triggered by flickering or flashing. Photosensitive epilepsy is a form of epilepsy that is triggered by visual stimuli, such as flickering or high contrast oscillating patterns, and it's believed that around 3% to 5% of people with epilepsy are susceptible to photosensitive material. Photosensitive epilepsy is usually triggered where the flicker rate is between 16Hz to 25Hz, although it's not uncommon for seizures to be triggered by flicker rates between 3Hz to 60Hz. The condition most commonly effects children, and is usually developed between the ages of 9 and 15 years, and most prevalent in females.

People with Attention Deficit Disorder/Dyslexia

Movement/animation may be extremely distracting to people with attention deficit disorder/dyslexia. User control of motion is needed.

People with Cognitive or Learning Disabilities

Some people have good cognitive skills, while others may have more artistic and creative skills. Learning disabilities are problems that affect the brain's ability to receive, process, analyze, or store information. Howard Gardner identified eight intelligences in his 1983 book, "The Theory of Multiple Intelligences". A toggle switch for different modalities would aid in providing learning style preferences (Visual, Aural, Read/Write, Kinesthetic, multi-modal). For instance images, sound, motion, captioning, description, etc all have a place in learning preferences.

Types of Learning Preferences


This preference includes the depiction of information in images. These folks visualize information in their "minds' eye" in order to remember something.


This perceptual mode describes a preference for information that is "heard". People with this modality report that they learn best from lectures, recordings. etc.


This preference is for information displayed as words.


By definition, this modality refers to the perceptual preference related to the use of experience and practice (simulated or real). Although such an experience may invoke other modalities.


A fifth category for those with strong preferences in several modalities: multi-modal.

Others Who Benefit from Accessible Multimedia

People who have:

  • Hardware limitations.
  • Software limitations.
  • Connectivity limitations. Slow dial-up connections (still common in rural areas as well as outside of the U.S.)
  • Other universality use cases.

Definitions of Media Assets


Transcript = the verbatim audio track. This transcript is then time-stamped and can become the caption asset, but, if/when time stamped with DFXP (an XML language) it can be sieved through an XSLT style sheet to generate on-screen HTML as well. It also can be used for search, both for appropriate assets (large-scale search) but also some experiments are ongoing at a more local level for searching within a longer media asset for key words, and then "clicking" on the key word instance and being taken to that point in the video.

Text Description

Text Description = the transcript of audio content that includes spoken words and also non-spoken sounds like sound effects. It is akin to the notation used for a play. Text description is often not verbatim accounts of the spoken word, but contain additional descriptions, explanations, or comments that may be useful. They are helpful to the deaf, hard of hearing and many others. They allow anyone that cannot access content from web audio or video to access a text version instead.

Audio Description

Audio Description = narration, spoken out loud. It explains visual details. This allows visual content to be accessible to the blind or those with vision impairments. Audio descriptions of visual content is important if, for example, a video provides content that is relevant to the overall understanding of the video but is not available/ recognized through the through the default audio already present. For example an audio description can take a movie, and talk you through it. The narrator tells you everything that is happening on the screen that you cannot figure out just from the soundtrack. Information that is presented exclusively visually needs an audio description, and this audio description needs to be synchronized with the presentation. Audio descriptions describe items that take place visually which are vital for the complete understanding of a multimedia presentation. The descriptions are part of the audio track and are inserted in during lulls in the audio conversation. A transcript does not provide an equivalent experience, as the presentation's message is dependent upon the simultaneous interaction between its audio and video portions. Including extended audio-descriptions, which pause the video, may also be a consideration.


Captioning = the process of capturing the spoken word into text and displaying text at the same time the words are spoken. Captions are needed for prerecorded audio content in synchronized media. For more info visit Understanding SC 1.2.2

Related Considerations


There is sometimes mention of high-contrast media; they can be useful for people with partial disabilities, for example. In high-contrast video 'important' material is more clearly visible, backgrounds are uncluttered when possible etc. High-contrast audio is similar; it strives to make the semantically important audio more clearly heard while minimizing background music, noises, and so on.

Policies, Guidelines, and Law


Accessibility Task Force Media Meetings

Media Meetings Prior to Task Force Formation

Related References

Email Discussion Threads

July 2007

September 2007

August 2008

September 2008

October 2008

December 2008

January 2009

June 2009

July 2009

August 2009

September 2009

October 2009

November 2009

January 2010

February 2010

March 2010

April 2010

May 2010

June 2010

July 2010

August 2010

September 2010

October 2010

November 2010

December 2010

January 2011

February 2011

March 2011

April 2011

May 2011

June 2011

July 2011

Search on Markmail for Related E-mail