Recommendations for accessible captions in 360 degree video

Final Community Group Report

This version:
Latest editor's draft:
Frances Baum
Meryl K. Evans
Chrisopher Patnoe (Google LLC)
Howard A. Rosenblum (National Association of the Deaf)
GitHub w3c/immersive-captions-cg (pull requests, new issue, open issues)


This document presents findings of the Immersive Captions Community Group after initial research into the requirements for captioned content in virtual and augmented reality. The group intends to produce design guidance starting from these findings.

Status of This Document

This specification was published by the Immersive Captions Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Final Specification Agreement (FSA) other conditions apply. Learn more about W3C Community and Business Groups.

This is a Community Group Final Report from the Immersive Captions Community Group.

GitHub Issues are preferred for discussion of this specification.

1. Introduction

1.1 The Need

Technology moves at an increasingly fast pace, yet accessibility has often been an afterthought—if it is one at all. Immersive video experiences, still in their infancy, give us an opportunity to ensure accessibility is built in from the start. It is our hope that this shift brings innovation to captioning in all relevant tech areas. After all, WHO (World Health Organization) predicts that by 2050 nearly 2.5 billion people are projected to have some degree of hearing loss and at least 700 million will require hearing rehabilitation. Over 1 billion young adults are at risk of permanent, avoidable hearing loss due to unsafe listening practices.

XR experiences are moving toward the mainstream, and there’s a growing realization within the tech industry of the importance of accessibility in all emergent technology. “XR” is the abbreviation for extended reality and to represent both virtual reality (starting with the Sensorama in 1956) and augmented reality (first demonstrated by USAF Armstrong’s Research Lab by Louis Rosenberg in 1992). XR is powering the widespread reach of immersive technologies, and with the support of the tech industry, accessibility is finally being discussed before the technology has reached real adoption.

Though there is a need for innovative captioning across all immersive experiences, this paper will focus on immersive and 360 video. In virtual reality (VR), interesting questions arise about how to effectively provide captioning because, for on example, the source of sound in the virtual experience may not be in the viewer’s field of view. In that scenario, a deaf participant would not understand what is happening if there are no captions or if the captions have significant limitations.

1.2 Who We Are

In July 2019, Cornell Tech and Verizon Media sponsored the inaugural XR Access Symposium held in New York City. A collection of tech companies, educators, creators, and advocates attended breakout sessions and presentations on accessibility in XR. Several of the participants banded together in an effort to create immersive captioning.

Christopher Patnoe of Google volunteered to found and chair the community group and connected with Judy Brewer of World Wide Web Consortium (W3C). Together they formed an Immersive Captions Community Group (ICCG) under the Web Accessibility Initiative (WAI). The group is a broad coalition of people who identify as d/Deaf, hard of hearing, and hearing. We are made up of industry (Google, Meta Platform - formerly known as Facebook, etc.), advocates (National Association of the Deaf), educators (Brandeis University, Gallaudet University, Rochester Institute of Technology, University of Salford), and technologists.

We have spent the past two years exploring immersive technology and captioning opportunities. Our goal has been to break the boundaries of the little black box inherited from TV and find new ways to make virtual reality more inclusive. This paper represents our narrow but critical effort to document our insights and ideas for accessible 360 video, and the seeds of this effort will impact future research.

1.3 Aligned efforts

No important work is ever done in a vacuum. Other organizations are also engaged in this discussion, including the XR Access initiative that started this effort. There is also the ImAc effort out of the EU whose research was core to our work.

At this time in the W3C, there are other initiatives working in the caption space such as:

1.4 Disclaimers

This group does not intend to declare how something is to look; we have taken great efforts to avoid this. Instead, we’ve tried to focus on what functionality will solve problems of situational awareness and to provide recommendations on functionality that will make the experience of the deaf or hard of hearing viewer more equal through technology. It is important to remember that there is no single solution that will work for everyone, so it’s ideal to offer flexibility and tools that enable a user to customize the functionality to meet their own needs.

3. Physical Considerations

3.1 Vestibular Disorders

Immersive experiences can be challenging for those with vestibular disorders, and the additional movement of captions, including headlocked captions, can be difficult for some users. One study has found an estimated 70 percent of deaf and hard of hearing children with sensorineural hearing loss have a vestibular disorder. Vestibular disorders affect many people, not just those who are deaf or hard of hearing. The Vestibular Disorders Association says more than 35 percent of US adults aged 40 and older experience vestibular dysfunction at some point in their lives. It’s always a better user experience to let viewers have control over any motion.

3.2 Mental Processing

Meaningful understanding of audiovisual information requires mental processing for both aural and visual information and for understanding connections between them. Deaf and hard of hearing viewers often have to choose between watching visuals or aural-to-visual translation — the captions. Regardless of the choice, some information may be lost. They rarely have enough time to perceive and understand both visuals. Consequently, information needs to be presented in a way that gives sufficient time to process both effectively. In addition, viewers often encounter visual noise, such as line of sight interference, obstruction, or poor lighting. Visual noise tends to be a mere annoyance for hearing users, but it can significantly interfere with visual access for deaf and hard of hearing users.

Reading captions when there are multiple speakers can be very challenging for caption readers because speakers can rapidly take turns in any order. This is especially true in immersive video, where the viewer has full control of their FOV but may not know where to focus their attention. The caption reader is usually focused on reading the captions and cannot anticipate who will speak next. As a result, the reader usually looks back and forth between captions and the speakers, and they become tired or distracted. Additionally, readers can feel left out of the conversation.

While the traditional method of inserting the speaker’s name in captions for speaker identification has been shown to be useful for viewers who are hard of hearing, studies have also found that viewers who were deaf did not view it as useful. One reason why deaf viewers may not view traditional captions as useful is that captions that stay in one location do not show the location of the speaker. Furthermore, if the speaker moves around, this creates distance between the visual information and captions, which forces the viewer to move his/her eyes or head constantly. As a result, the student will focus only on the captions or visuals to his/her detriment.

Viewers often use captions to confirm or correct what they heard because they can review spoken information by reviewing captions that are displayed for several seconds. Generally, they prefer to have several lines of captions. On the other hand, viewers who do not listen to speech prefer to have 2-3 lines of text to minimize scanning time for words as they read.

4. Where We Are Going

The features discussed in this paper are a small portion of the ideas that were discussed to address the many issues that arise from adding immersive captions to 360 media. Before we can find solutions to unexplored issues, we need to share our findings with other groups and get input from other W3C groups who were involved in producing this report.

As we continue our research, there are questions that we were unable to answer. This technology is new and we will continue to learn. Here are some of the questions that need to be answered before wide adoption is possible. There is a logical order of operations for the work to be done.

Regarding the technology, there are questions about the base spec for making changes. If the goal is to have a consistent interoperability for uses in VR, we would seek for a unified version of the spec so there aren’t the myriad of captioning specs that we have to deal with today in 2D video across the internet. Some interesting opportunities could be the files such as SubRip (.srt), Timed Text Markup Language (.ttml), or WebVTT (.vtt).

Once we understand the specification, we should settle on the changes themselves. How do we describe the changes when dealing in a 360 environment? Our tool uses latitude/longitude and it works well, but this may not integrate nicely into all of the formats. And of course, as features are developed there will be further changes needed for the spec.

Once we have agreed on the spec and how to adapt it, we need to have an authoring environment that supports it. Thanks to the prototyping tool created by Chris Hughes, we have an excellent place to start. But as anyone who builds prototypes understands, it takes a great deal of effort to go from proof of concept to shipping product. This Community Group seeks for the tool to be open sourced. and we welcome anyone who is willing and able to participate.

Similarly, we would like to share our findings with scholarly and academic publishing platforms, as well as technology magazines and publishers to get feedback and create opportunities for research collaborations. We would like to collaborate with our industry partners to explore integrating our features into current VR platforms, devices, and operating systems. This will allow us to fine-tune proposed solutions as well as explore new ideas.

Another interesting way of getting user feedback is planning Hackathons based around our findings. During these sprint-like events, experts can explore ways of improving our features and show new ways of using these findings to make 360 media accessible to all. The opportunity to innovate existing captioning formats shines brightly!

Surely, the more feedback we get, the more refined our solutions will be. Discussing our findings with partners and other passionate minds will invigorate this opportunity to innovate and close the gap between existing captions and then needs of emerging display technologies.

A. Acknowledgements

A.1 Credits

We would like to thank Facebook Technologies, LLC for donating Oculus Go devices to the ICCG participants who did not have a head-mounted VR display. This gave the ICCG a common platform for prototyping and experimentation.

We would also like to thank Chris Hughes, University of Salford, for building an immersive captions prototyping tool to support the ICCG’s exploration of immersive captioning approaches, formats and styles.

Finally, we would like to thank DSPA (Deaf Services of Palo Alto) for providing the ASL interpreters, and for Google for paying for their services every meeting. We also want to thank the talented interpreters themselves, who have been critical for our work.

A.2 ICCG contributors

Chair: Christopher Patnoe, Google LLC

Contributing participants: