Media in XR

From Accessible Platform Architectures Working Group
Jump to: navigation, search

XAUR Alternative Content Technologies


This is a preliminary draft and has not yet been reviewed by any W3C WAI task force or working group.

These have been lifted from MAUR and edited, with the intent to make them relevant for XR applications. All with the [XR] suffix have been reviewed and edited.

ACTION: Bring these to APA for gap analysis review.


This is part of a modular approach to accessibility user requirements in XR and are focussed in this document on media. They are presented here with the view that this approach can also be extended to aspects of XR such as content, environment, objects, movement, interaction etc.

Described Video in XR

Described video contains descriptive narration of key visual elements designed to make visual media accessible to people who are blind or visually impaired. The descriptions include actions, costumes, gestures, scene changes or any other important visual information that someone who cannot see the screen might ordinarily miss. Descriptions are traditionally audio recordings timed and recorded to fit into natural pauses in the program, although they may also briefly obscure the main audio track (see the section on extended descriptions for an alternative approach).

  • [DV-XR-1] Provide an indication that descriptions for video are available, and details on their status (active/non-active).
  • [DV-XR-2] Render descriptions in a time-synchronized manner, using the primary media resource as the timebase master, ensure tracking between descriptions is maintained.
  • [DV-XR-3] Support multiple description tracks (e.g., discrete tracks containing different levels of detail).
  • [DV-XR-4] Support recordings of high quality speech, or video audio, as a track of the media resource, or as an external file.
  • [DV-XR-5] Allow the author to independently adjust the volumes and panning of the audio description and original soundtracks where these are available as separate audio channel resources.
  • [DV-XR-6] Allow the user to independently adjust the volumes and panning of the audio description and original soundtracks (where these are available as separate audio channel resources), based on different preferences for volume related to each audio.
  • [DV-XR-7] Permit smooth changes in volume rather than stepped changes.
  • [DV-XR-8] Allow the author to provide fade and pan controls to be accurately synchronized with the original soundtrack and any alternate tracks.
  • [DV-XR-9] Allow the author to use a codec which is optimized for voice only, rather than requiring the same codec as the original soundtrack.
  • [DV-XR-10] Allow the user to select from among different languages of descriptions, if available, even if they are different from the language of the main soundtrack.
  • [DV-XR-11] Support the simultaneous playback of both the video description track and primary audio resource tracks so that either may be directed to separate outputs. Where [a screen reader or other AT] is present, support audio output separation, panning and volume, from both the described video track and the primary audio resource tracks.
  • [DV-XR-12] Allow the user to relocate the pan location of the various audio tracks within the audio field, with the user setting overriding the author setting. The setting should be re-adjustable as the media plays.
  • [DV-XR-13] Support metadata, such as copyright information, usage rights, language, etc.

Text video description in XR

Text video descriptions (TVDs) are delivered to the client as text and rendered locally by assistive technology such as a screen reader or a Braille device. This can have advantages for screen-reader users who want full control of the preferred voice and speaking rate, or other options to control the speech synthesis.

  • [TVD-XR-1] Support presentation of text video descriptions through a screen reader, Braille device and/or modified print with playback speed control, voice control and synchronization points within the video.
  • [TVD-XR-2] TVDs need to be provided in a format that contains the following information:

start time, text per description cue (the duration is determined dynamically, though an end time could provide a cut point) possibly a speech-synthesis markup to improve quality of the description (existing speech synthesis markups include SSML and CSS 3 Speech Module) accompanying metadata providing labeling for speakers, language, etc. and visual style markup (see section on Captioning).

  • [TVD-XR-3] Where possible, provide a text or separate audio track privately to those that need it in a mixed-viewing situation, e.g., through headphones or other output.
  • [TVD-XR-4] Where possible, provide options for authors and users to deal with the overflow case: continue reading, stop reading, and pause the video. User preference should override authored option.
  • [TVD-XR-5] Support the control over speech-synthesis playback speed, volume and voice, and provide synchronization points with the video. Ensure tracking is maintained between multiple audio outputs.

RTT section needed


Extended Video Descriptions

  • [EVD-XR-1] Support detailed user control as specified in [TVD-XR-4] for extended video descriptions.
  • [EVD-XR-2] Support automatically pausing the video and main audio tracks in order to play or control a lengthy description found in an alternate track.
  • [EVD-XR-3] Support resuming playback of video and main audio tracks when the description or other content found in an alternate track, is finished.

Clean Audio

  • [CA-XR-1] Support clean audio as a separate, alternative audio track from other audio-based alternative media resources, including the primary audio resource.
  • [CA-XR-2] Support the synchronization of multitrack audio either within the same file or from separate files - preferably both.
  • [CA-XR-3] Support separate volume control and panning of the different audio tracks.
  • [CA-XR-4] Support pre-emphasis filters, pitch-shifting, and other audio-processing algorithms.

Content Navigation in XR

Granularity Levels

Navigating ancillary content

  • [CN-XR-1] Provide a means to structure media resources so that users can navigate them by semantic content structure. Support keeping all media alternatives synchronized when users navigate.
  • [CN-2] The navigation track should provide for hierarchical structures with titles for the sections.
  • [CN-3] Support both global navigation by the larger structural elements of a media work, and also the most localized atomic structures of that work, even though authors may not have marked-up all levels of navigational granularity.
  • [CN-4] Support third-party provided structural navigation markup.
  • [CN-5] Keep all content representations in sync, so that moving to any particular structural element in media content also moves to the corresponding point in all provided alternative media representations (captions, described video, transcripts, etc) associated with that work.
  • [CN-6] Support direct access to any structural element, possibly through URIs.
  • [CN-7] Support pausing primary content traversal to provide access to such ancillary content in line.
  • [CN-8] Support skipping of ancillary content in order to not interrupt content flow.
  • [CN-9] Support access to each ancillary content item, including with "next" and "previous" controls, apart from accessing the primary content of the title.
  • [CN-10] Support that in bilingual texts both the original and translated texts can appear on screen, with both the original and translated text highlighted, line by line, in sync with the audio narration.
  • [CN-SR-11]Support navigation and interaction modalities of Assistive Technology within XR applications such as content browsing, navigation.
  • [CN-SR-12] Avoid breaking or introducing new interaction models unless absolutely necessary and/or by informing the user of the shift.


For people who are deaf or hard-of-hearing, captioning is a prime alternative representation of audio. Captions are in the same language as the main audio track and, in contrast to foreign-language subtitles, render a transcription of dialog or narration as well as important non-speech information, such as sound effects, music, and laughter. Historically, captions have been either closed or open. Closed captions have been transmitted as data along with the video but were not visible until the user elected to turn them on, usually by invoking an on-screen control or menu selection. Open captions have always been visible; they had been merged with the video track and could not be turned off.

  • [CC-1] Render text in a time-synchronized manner, using the media resource as the timebase master.
  • [CC-2] Allow the author to specify erasures, i.e., times when no text is displayed on the screen (no text cues are active).
  • [CC-3] Allow the author to assign timestamps so that one caption/subtitle follows another, with no perceivable gap in between.
  • [CC-4] Be available in a text encoding.
  • [CC-5] Support positioning in all parts of the screen - either inside the media viewport but also possibly in a determined space next to the media viewport. This is particularly important when multiple captions are on screen at the same time and relate to different speakers, or when in-picture text is avoided.
  • [CC-6] Support the display of multiple regions of text simultaneously.
  • [CC-7] Display multiple rows of text when rendered as text in a right-to-left or left-to-right language.
  • [CC-8] Allow the author to specify line breaks.
  • [CC-9] Permit a range of font faces and sizes.
  • [CC-10] Render a background in a range of colors, supporting a full range of opacity levels.
  • [CC-11] Render text in a range of colors. The user should have final control over rendering styles like color and fonts; e.g., through user preferences.
  • [CC-12] Enable rendering of text with a thicker outline or a drop shadow to allow for better contrast with the background.
  • [CC-13] Where a background is used, it should be possible to keep the caption background visible even in times where no text is displayed, such that it minimizes distraction. However, where captions are infrequent the background should be allowed to disappear to enable the user to see as much of the underlying video as possible.
  • [CC-14] Allow the use of mixed display styles— e.g., mixing paint-on captions with pop-on captions— within a single caption cue or in the caption stream as a whole.
  • [CC-15] Support positioning such that the edge of the captions is a sufficient distance from the nearest screen edge to permit readability (e.g., at least 1/12 of the total screen height above the bottom of the screen, when rendered as text in a right-to-left or left-to-right language).
  • [CC-16] Use conventions that include inserting left-to-right and right-to-left segments within a vertical run (e.g. Tate-chu-yoko in Japanese), when rendered as text in a top-to-bottom oriented language.
  • [CC-17] Represent content of different natural languages. In some cases the inclusion of a few foreign words forms part of the original soundtrack, and thus needs to be so indicated in the caption. Also allow for separate caption files for different languages and on-the-fly switching between them. This is also a requirement for subtitles. See also [CC-20]
  • [CC-18] Represent content of at least those specific natural languages that may be represented with [Unicode 3.2], including common typographical conventions of that language (e.g., through the use of furigana and other forms of ruby text).
  • [CC-19] Present the full range of typographical glyphs, layout and punctuation marks normally associated with the natural language's print-writing system.
  • [CC-20] Permit in-line mark-up for foreign words or phrases.
  • [CC-21] Permit the distinction between different speakers.

Further, systems that support captions must:

  • [CC-22] Support captions that are provided inside media resources as tracks, or in external files.
  • [CC-23] Ascertain that captions are displayed in sync with the media resource.
  • [CC-24] Support user activation/deactivation of caption tracks.
  • [CC-25] Support both edited and verbatim captions when available.
  • [CC-26] Support multiple tracks of foreign-language subtitles including multiple subtitle tracks in the same foreign language.

NOTE: These different-language "tracks" can be provided in different resources.

  • [CC-27] Support live-captioning functionality.
  • [CC-28] Enable the bounding box of the background area to be extended by a preset distance relative to the foreground text contained with that background area.
  • [CC-XR-29] Allow the user to place captions due to their preference in XR environments.

Enhanced captions/subtitles

  • [ECC-1] Support metadata markup for (sections of) timed text cues.
  • [ECC-2] Support hyperlinks and other activation mechanisms for supplementary data for (sections of) caption text.
  • [ECC-3]Support text cues that may be longer than the time available until the next text cue, thus providing overlapping text cues.
  • [ECC-4] Support timed text cues that are allowed to overlap with each other in time and be present on screen at the same time (e.g., those that come from the speech of different individuals). Also support timed text cues that are not allowed to overlap, so that playback will be paused in order to allow users to catch up with their reading.
  • [ECC-5] Allow users to define the reading speed and thus define how long each text cue requires, and whether media playback needs to pause sometimes to let them catch up on their reading.
  • [ECC-XR-7] Allow the user to export captioned/subtitled tracks for later viewing outside of an XR session.

Sign Translation

  • [SL-1] Support sign-language video either as a track as part of a media resource or as an external file.
  • [SL-2] Support the synchronized playback of the sign-language video with the media resource.
  • [SL-3] Support the display of sign-language video either as picture-in-picture or alpha-blended overlay, as parallel video, or as the main video with the original video as picture-in-picture or alpha-blended overlay. Parallel video here means two discrete videos playing in sync with each other. It is preferable to have one discrete <video> element contain all pieces for sync purposes rather than specifying multiple <video> elements intended to work in sync.
  • [SL-4] Support multiple sign-language tracks in several sign languages.
  • [SL-5] Support the interactive activation/deactivation of a sign-language track by the user.
  • [SL-XR-6] Allow the user to export a Sign Translation track for later viewing outside of an XR session.


  • [T-1] Support the provisioning of a full text transcript for the media asset in a separate but linked resource, where the linkage is programmatically accessible to AT.
  • [T-2] Support the provisioning of both scrolling and static display of a full text transcript with the media resource, e.g., in an area next to the video or underneath the video, which is also AT accessible.
  • [T-3] Allow the user to customize the visual rendering of the full text transcript, e.g., font, font size, foreground and background color, line, letter, and word spacing.
  • [T-XR-4] Allow the user to export a text transcript for later viewing outside of an XR session.