in Making Audio and Video Media Accessible


Who: Captions (also called “intralingual subtitles”) provide content to people who are Deaf and others who cannot hear the audio. They are also used by people who process written information better than audio.

What: Captions are a text version of the speech and non-speech audio information needed to understand the content. They are displayed within the media player and are synchronized with the audio.

Most are “closed captions” that can be hidden or shown by people watching the video. They can be “open captions” that are always displayed and cannot be turned off.

Captions and Subtitles

The terms “captions” and “subtitles” are used for the same thing in different regions of the world. This resource uses:

Some regions use subtitles for both the same language as the audio and for the translation. Sometimes they are distinguished as intralingual subtitles (same language) and interlingual subtitles (different language).

Subtitles are implemented the same way as captions. Subtitles/interlingual subtitles are usually only the spoken audio (for people who can hear the audio but do not know the spoken language). They can be a translation of the caption content, including non-speech audio information.

Captions are needed for accessibility, whereas subtitles in other languages are not directly an accessibility accommodation.

Live Captions

Live captions are usually done by professional real-time captioners or Communication Access Realtime Translation (CART) providers. Live captions can be done in-person or remotely. That is, the person doing the captioning/CART does not have to be at the same location as the live action; they can be doing the live captions by listening to the audio over a phone or Internet connection.

If you have live captions and you post a recording, you will probably need to do minor editing for accuracy.

This rest of this page addresses developing captions for pre-recorded media.

Interactive Transcripts from Captions

Caption files are used by some media players to provide interactive transcripts. Interactive transcripts highlight text phrases as they are spoken. Users can select text in the transcript and go to that point in the video. Some players provide interactive transcript functionality.


For optimum accessibility, provide a separate caption file of the description of visual information (called audio description, video description, or described video).

Captions and transcripts include the same text, so one can be used to develop the other.

Does My Media Need Captions?

This section tells you:

WCAG excerpts with links to more information in “Understanding WCAG”:

Skills and Tools

Creating captions requires typing up the audio (“transcribing”) and formatting it in a file with timestamps. Transcribing an audio file is fairly difficult and takes quite a bit of time for people who don’t have the software and skill for it. The file format for captions are simple, yet it’s tedious to add timestamps, especially without software or service for developing caption files.

Creating high-quality captions requires knowledge of which non-speech audio information should be included in the captions. It’s more art than science — for example, it’s not always clear which non-speech audio information to include and how to communicate it in text.

Even correcting an automatic caption files takes quite a bit of time for people who don’t do it regularly.

However, people who have the software, skills, and experience in developing captions, can develop them much faster.

For these reasons, many organizations choose to outsource their captions.

Automatic Captions are Not Sufficient

Automatically-generated captions do not meet user needs or accessibility requirements, unless they are confirmed to be fully accurate. Usually they need significant editing.

There are tools that use speech recognition technology to turn a soundtrack into a timed caption file. For example, some common video websites provide automatic captions. However, often the automatic caption text is wrong and does not match the spoken audio — sometimes in ways that change the meaning (or are embarrassing). For example, missing just one word such as “not” can make the captions contradict the actual audio content.

Automatic captions can be used as a starting point for developing accurate captions and transcripts.

Creating Captions

Caption File Format

The most common format for captions on the web is WebVTT: The Web Video Text Tracks Format.

Other caption formats are: SRT and Timed Text Markup Language (TTML).

Caption Tools

Most people use software or services to help develop captions. There are several free captioning software programs and online services available.

Several free and fee-based tools create automatic captions that you can use as a starting point. For example, a common video website includes automatic captions and tools for you to edit the captions. You will need to edit automatic captions for accuracy.

If you already have transcription of the audio into text, there are free tools that will generate a captions file with timestamps. You will need to edit it for line breaks as described in another page of this resource, Transcribing Audio to Text: More on Captions.

Most caption-editing tools can export a plain text transcript.

The screen capture shows one tool for editing captions, in the area underneath the video.

Transcribing Audio to Text

For specific guidance on what to type up, see another page in this resource: Transcribing Audio to Text.

Positioning and Styling Captions

There are options for authors to position and style captions. Support in browsers and other media players is inconsistent and sometimes unreliable. Most web videos just use the player’s default presentation style, which is usually white characters in a black box.

Some media players enable users to set preferences for how and where captions are displayed, including text style, text size, colors, and position of the captions.

Back to Top