Understanding WebXR and Accessibility Architecture Issues

It is challenge to understand the complex technical architecture and processes behind how XR (Virtual, Augmented and Mixed reality) environments are currently rendered.

To make these environments accessible and provide a quality user experience it is important to first understand the nuances and complexity of accessible user interface design and development for the 2D web. Any attempt to make XR accessible also needs to be based on meeting the practical user needs and requirements of people with disabilities.

This document aims to bridge these worlds by highlighting some fundamental issues and potentially indicate some solutions.

The DOM and Accessibility Tree

Most accessibility professionals and developers are familiar with the concept of the DOM, or Document Object Model and the consequent Accessibility Tree that is generated from it. The DOM is the tree representation of the semantic content contained with the HTML page, or injected via .jsx templates or web components.

This DOM then generates a separate accessibility tree, via the browsers accessibility API (such as IAccessible2, UIAutomation, NSAccessibility, UIAccessibility etc ) in order to expose any present role state and property values to Assistive Technologies. Successfully making 2D documents accessible is the process of ensuring this accessibility tree is populated with useful and accurate information for the Assistive Technology to consume and provide a quality user experience. Accessibility APIs are implemented at the operating system level, and each is specific to its own platform.

XR presents interesting challenges for accessibility as these dynamically rendered environment content is often rapidly changing, and therefore need semantically rich descriptions that go way beyond current accessibility requirements for 2D type documents or media.

XR also pushes the limits of current DOM to Accessibility Tree model suitability in XR, as new mechanisms may be needed to successfully expose new semantic models with much faster API call dynamics and interaction to Assistive Technology. We may find that the current DOM to Accessibility Tree Model is a rather limited mechanism to convey XR semantics to various AT so this model may need to be expanded to include other types of methods to deliver semantics to Assistive Technology.

If there are different semantic delivery models developed, these outputs may need to be combined and to 'routed' as outputs to multiple devices, such as braille or haptics, to aid comprehension and support the user in the XR environment.

One Accessibility API to rule them all?

It is important to clarify that not all accessibility considerations are addressed only by accessibility APIs. These APIs feed assistive technologies such as screen readers, speech input for example but there are other accessibility issues that need to be addressed in other ways.

There is an architectural issue regarding to what extent existing assistive technologies and their underlying APIs should be extended to include XR environments, and to what extent the necessary accessibility features should be implemented directly in the libraries and components used to build XR applications.

Different Rendering Types

When creating a web page using HTML the author doesn't have to say to the browser 'draw this pixel at this point, at that point and join the dots'. The author will tell the browser that a certain component is needed in a particular place, and by declaring they want an input box, checkbox, or heading the browser will render it. This is declarative authoring, where the author is not imperatively asking the graphics processor (GPU) to draw pixels in a particular location.

There are two types of rendering, declarative rendering - such as with the HTML code example for a button,radiobutton and imperative rendering where content is drawn 'pixel-by-pixel' depending on many factors such as where content needs to be drawn in a certain view or where the users field of vision currently is resting.

There are different APIs in use that perform these functions such as Vulkan, Apple's Metal, but one of the main JavaScript APIs for 3D rendering is WebGL. WebGL currently has the ability to load 'assets' such as semantic scene graphs (more below) which are potentially very useful for accessibility as a delivery mechanism for semantics.

Declarative rendering adds content to the DOM, and this is displayed in the browser depending on what the user is doing or how they are interacting. Asynchronous updates for example, or data driven changes, such as with React type applications, will redraw the browser view and update the content in the DOM. This content 'data snapshot' then feeds and regenerates the Accessibility Tree, via the browsers accessibility API, and this is what talks to Assistive Technology.

For declarative rendering, the style and positioning of these HTML elements is determined by the user agent, such as the browser, via the rules declared in CSS.

Imperative rendering is the opposite of declarative rendering where the content of various 'buffers' is sent them to the GPU and given commands that draw to the screen pixel by pixel. This presents challenges from an accessibility perspective as these pixels have no semantic layer, or presence.

The 2D Canvas API is an example of this imperative rendering, and has similar challenges. Some of these were addressed partially by hit testing for accessibility, and the Shadow DOM. Both of which, while partially addressing some accessibility requirements were not full or adequate engineering solutions.

NOTE: WebGL has many buffers, for example the drawingbuffer (also called the backbuffer) and the displaybuffer (also called the frontbuffer). The term 'buffer' is used to refer to used for drawing, storing data and more but essentially buffers provide rendering 'context' that is then drawn in the 2D canvas. These 3D imperative drawing APIS are standardized not through W3C but through Khronos WebGL.

How to make imperatively rendered content accessible?

Imperatively rendered content currently is placed within a canvas element. This presents a substantial accessibility challenge for 2D or 3D content in the browser as the canvas element does not any provide semantic information about any of its drawn objects in terms of their role, state or properties. As there are no declarative semantics canvas content is just not exposed to Assistive Technology or accessibility APIs 'out of the box'.

This way of imperatively rendering content has serious limitations for the delivery of suitable semantics to Assistive Technology. Therefore an approach outlined below was taken to provide a method of declarativly providing semantics such as role, state or property information. This approach is called 'hit testing'.

There are also tricks such as adding focus rings for keyboard only users, with various useful methods, and the addition of ARIA roles, states and properties. All these approaches aim to address the lack of inherent support for the requirements of people with disabilities or users of AT, within imperatively drawn canvas type content.

When the hit testing method fails authors are advised to populate the canvas with fallback content that can be parsed by AT. A major problem is that there is no effective binding between what has focus on screen within the drawn region and the fallback content. So often this fallback is just a static representation of some aspects of the canvas region, but it is not dynamic and provides no sense of dynamic context for the AT user.

NOTE: Hit testing also has a parallel place XR, and it will be interesting to further discuss the intersection between the approach outlined in this WebXR Hit testing Explainer and the nexus with XR accessibility requirements. Will the older hit testing method be sufficient? How can this approach be iterated to be more successful or are their other better approaches?

What is WebXR Device API?

The WebXR API supports rendering 3D or immersive content to hardware equipment that supports VR/AR. These may include 3D headsets and mobile phones that support augmented reality using their onboard graphics, gyroscope, accelerometer and the device cameras.

WebGL and WebXR API - how do they relate?

How does the rendering engine, such as WebGL relate to WebXR device API? WebGL is a graphics language that takes imperative data and turns it into pixels on a screen. WebXR manages the context and provides the information to WebGL on 'where' and 'how' to draw those pixels. WebXR uses WebGL as its rendering mechanism, but that alternative mechanisms may be supported in the future (i.e., this is an intentional point of extensibility).

WebXR and the Fustrum or FOV

The Fustrum is the part of a solid, such as a cone or pyramid, between two parallel planes cutting the solid, especially the section between the base and a plane parallel to the base. In the context of XR this fustrum is a cropped upside down pyramid with the top at your eyes, or the Field of View (FOV). FOV is the main area of visual focus at any point in time in the immersive environment. The fustrum and FOV are similar concepts but the fustrum also includes the near plane and far plane.

WebXR describes the shape of this view frustum for the view camera. Items that are not need in this view can be 'culled' as they are not required.

The WebXR API provides the information to WebGL that it needs to know about 'where' to draw pixels. It needs to both where to draw at any point in time, and how far the user may have moved from the origin of the space.

When drawing for a headset this may have to be drawn at least stereoscopically, sometimes more as some headsets have more multiple screens per eye. When a session is initiated you get a frustum for each panel in a headset. It is possible to create combined frustums. Then motion controllers are drawn to place them in the correct place. Once the render loop is complete then pixels can be displayed by the User Agent.

WebXR rendering on a headset or monitor

In simplest terms the WebXR API describes the 'view frustum' from each panel and the location of each panel. Once WebGL has joined those pixels they don't go on the monitor directly but they to the display in the headset (which runs at a different frame rate to your monitor). It is important to understand that the refresh rate on a monitor will often differ from the headset. Lower refresh rates can cause sickness for the user.

The API then describes how to send the images to the various screens. The hardware may need to slightly move the images to account for some minor head motion. This is known as re-projection and important as it helps to stop or reduce simulation sickness.

How screen, and headset rendering works via Request Animation Frame (RAF) loops

The requestAnimationFrame() method is a JavaScript method that tells the browser that you wish to perform an animation and requests that the browser call a specified function to update an animation before the next repaint.

These callbacks happen in a loop, often called a 'RAF loop'. To draw an XR scene:

Use the device’s requestAnimationFrame() callback.
Request the current pose, orientation, and eye information from the device.
Divide the WebGL context into two halves, one for each eye, and draw at a suitable rate for the headset of device.

As a part of the RAF loop, on a monitor at 60fps the screen is wiped and redrawn, you can hook into this loop with requestAnimationFrame(), to move objects before the next draw. A headset may have a refresh rate of 120Hz, and will require frames to be drawn at that rate which may be different from the host machine.

Other methods are used to determine where the user head is positioned, rotated etc and there are also Projection Matrix methods that handle perspective views.

RAF loops on window objects and WebXR device API

There are RAF on both the window object and on the WebXR device API. After initialisation of the XR session, the RAF is on the XR session object.

The developer will ask if there is AR/XR hardware in use, using an isSessionSupported() method, to say draw a button on screen. On the button handler the developer will call navigator.XR.requestSession() and that is where the session begins. It is asynchronous callback from a promise, and will set up a new session, with everything needed to start rendering.

Starting the XR WebGL Layer

This creates 2D buffers and will draw 2D content into the boundary of the displays on the headset. It is important to render directly to the buffers, because copying pixels between buffers slows things down and results in minimized refresh rates, and may make the user sick.

Understanding WebXR context

The WebXR device API provides a crucial context, that the WebGL renders into. This may be content wrapped and drawn into a HTML5 canvas element.

The context is what is used to call the commands 'from' to render 'into'. This context comes from the canvas but it maybe a WebGL, WebGL2 or WebGPU context. There is a '1 to 1' mapping between the canvas and the context. However, a canvas may have multiple contexts. Passing a canvas will pull out the contexts it needs to render the content to the headset in the various views.

NOTE: Currently data can't be shared between different WebGL contexts for security reasons.

The WebGL context is associated with the XR device and the final buffer drawn into goes directly to the display. This is drawn onto the pixel but maybe shifted for the purposes of re-projection.

From an accessibility perspective a challenge or issue, is ensuring that these rendered contexts are described in a way that makes sense to Assistive Technology. A problem is that current AT needs to infer this context from descriptions provided by the author and designer of the XR experience. This is a challenge due to the limitations in the ability to add declarative semantics in the authoring tools used to create XR type experiences.

NOTE: Taking an Object Orientated approach to adding rich semantics may help with the overhead of context delivery and redrawing. A non-sighted user will not need to know the minute details of a scene or have it repeatedly 'redrawn' to a head set as long as the have a persistent view of the objects in the environment a table, with virtual fruit and a communication device like a virtual smart phone or other communication device, and the context or details of the environment. They will need to know any affordances related to any objects or items they can interact with or use - and this will need to be communicated to them clearly.

Through the use of 'modal muting' or turning off the need to redraw and refresh displays constantly - a non-sighted user can enter an auditory abstraction of the environment, get clear information about what it is, and what is in it - and use a combination of auditory cues - such as directional soundings when near or far from a chosen object. The environment itself could have 'modes' where auditory cues and auditory way finding could be used, in conjunction with output to a screen reader or braille device which is fed semantics.

Getting environment and object data via Scene Graphs, glTF

An area of great potential for XR accessibility 'Scene Graphs' A scene graph is a general data structure, represented as nodes in a tree or graph that is used to provide descriptive information about an environment and what it contains. They are commonly used in computer games, which arranges the logical representation of a graphical scene and its content. This may be used to show spatial relationships. Also spatial graphs may be used.

This may be OpenSceneGraph a 3D graphics toolkit, used by application developers in fields such as visual simulation, games, virtual reality, scientific visualisation and modelling. It is used widely in many industries.

The nodes in a scene graph are used to identify entities or objects in the scene. Scene graphs should be thought of as composed of primitives or shapes that can be rendered as needed but have no semantic or declarative aspect to them.

Semantic Scene Graphs

A semantic scene graph is an effective tool for representing physical and contextual relations between objects and scenes and this area is worth exploring from an accessibility perspective as it may help to provide a proof of concept for a combined imperative and declarative model that will enable multi-modal accessible rendering.

A useful option for all players in this space to discuss is to:

Explore and test if semantic scene graphs can expose useful information to current Assistive Technology.
The need to standardise semantic scene graphs for accessibility.

glTF - as delivery mechanism for semantics

glTF (GL Transmission Format) is specification for the efficient transmission and loading of assets such as 3D scenes, models, geometry and textures into a rendering engine like WebGL.

glTF does support scene graphs and semantic scene graphs that can be then inserted into WebGL and used for rendering. For example, for semantic scene understanding there is the glTF 2.0 Scene Description Structure.gltf (JSON) PDF File.

In the world of the widely supported WebGL and imperative rendering a current potential delivery mechanism for declarative semantics is glTF.

There is no known work on providing current accessibility support in glTF. However, glTF provides two main mechanisms to "annotate" glTF models:

a generic hook to put application-specific metadata (using the "extras" property) - this would probably be a poor fit given that these are meant for app-specific metadata, where one would expect accessibility annotations to be interoperable across a variety of consumers and producers
extensions to glTF; these extensions come themselves in 3 levels:
vendor-prefixed extension - W3C itself could be a vendor in that sense if W3C wanted to develop an extension
multi-vendor extensions (prefixed with "ext-") that requires some level of interoperability demonstration
Khronos-ratified extensions that are expected for wide interoperability and needs among other things IP review.

An option that needs to be explored is research to test the viability of bringing an accessibility layer to glTF. This could be via a Khronos-ratified extension, this could be joint work between W3C and Khronos.

Current work on Declarative Semantics, Model Formats and engines

There are current declarative 3D formats: X3D (from the Web3D consortium), and XSeen. There may be no dedicated efforts on providing accessibility hooks to these formats; XSeen integrates with HTML and so in theory would benefit from the availability of ARIA attributes or AOM type properties. Bringing accessibility to 3D model formats (such as glTF, but also USDZ, FBX) would help with some use cases (e.g. where the 3D model is core to part of the experience, e.g. shopping,). However, where the semantic of interactions reside lie in the relationship between objects, or their role in the environment this may be out of scope for model formats. The 3D engines themselves (unity, unreal) also have a role in providing the right hooks.

Input and Interaction

@@ Add piece on Gamepad API and describe the hardware (including input and rendering) would be valuable prior to entering into the details.

Interaction in XR is complicated especially if we consider the potential range of input and outputs that may need to be managed and work in sync. Thinking of the current parallel of relatively simple mouse and touchscreen syncing and its inherent difficulties is a good way to frame the challenges for successful mapping of Assistive Technology interaction with various more traditional input and output mechanisms in XR.

As an XR space is usually larger than the physical space of the user, there are often mechanisms to trigger a 'teleport' function to move around these spaces. There are established mappings of hand held motion controllers to grab objects, as well as selection at a distance, that you can aim at an object, select it and pick it up, move it etc.There are now mainstream commercial devices available such as the Microsoft One that can mediate these interactions for people with disabilities.

Currently these interactions are generalised via an XRInputSource object that is called when a session is initiated (along with Target Rays Grip location methods).

Inputs can however come from any sources and they can be used to call these methods on particular objects. They may be generic select events, or fake/simulated events can be fired. Other interactions can be triggered via polyfils.

ARIA, the Accessibility Object Model and XR

ARIA is an accessibility annotation language that is primarily designed to make web applications accessible via a suite of roles/states and other properties. ARIA implementation in browsers has been very successful and it is broadly supported by many of the major screen readers and other Assistive Technology.

The Accessibility Object Model or AOM project aims to improve certain aspects of the user and developer experiences concerning the interaction between web pages and assistive technology.

In particular, this project is concerned with improving the developer experience around:

building Web Components which are as accessible as a built-in element;
expressing and modifying the semantics of any Element using DOM APIs;
expressing semantic relationships between Elements;
expressing semantics for visual user interfaces which are not composed of Elements, such as canvas-based user interfaces;
understanding and testing the process by which HTML and ARIA contribute to the computation of the accessibility tree.
By reducing the friction experienced by developers in creating accessible web pages, and filling in gaps in what semantics may be expressed via DOM APIs, the APIs proposed in the Accessibility Object Model aim to improve the user experience of users interacting with web pages via assistive technology.

This specification has promise as a bridging set of semantics as they may be suitable for generic descriptions of aspects such as objects, their purpose and so on in Immersive Environments.

The AOM also has a 'Virtual Accessibility Nodes' that are not associated directly with any particular DOM element, and may be exposed to assistive technology. As yet there may not be many successful implementations of this but the idea is promising.

AOM is currently confined to ARIA roles, states, and properties, which may not be sufficiently expressive - even in ARIA 1.2 - to convey the structure and relationships inherent in a 3D XR scene adequately to create a quality interaction for the AT user. Whatever mechanism is used a generic set of declarative semantics are required, to be used along with imperative rendering - and it needs to be worked out of this can be developed with traditional DOM to Accessibility API, or Semantic Scene Graph, or the AOM - or some combination of the three.

Stochastic Accessibility Rendering and other ideas

An interesting option to explore is stochastic accessibility rendering to potentially front load semantics that can be used by Assistive Technology depending on the choices the user makes at any point in time.This type of probability modelling could anticipate what a user is likely to do and may lighten the number of API calls, and rendering load needed to successfully describe the XR environment to Assistive Technology and the user at any point in time.

There may also be need for xr-role="none" type semantic as a particular challenge is to distinguish between items that are critical to the understanding of a 3D experience vs those that are not.

Exploring how categorization done in real-world accessibility (e.g. with a guide dog) could help in navigation of virtual worlds contexts as well.

There is also work underway with the Khronos 3D Commerce group exploring metadata that may have accessibility implications.

Acknowledgements

This document is authored/complied by Joshue O'Connor (W3C/WAI). It came to be as a result of a very useful conversation at TPAC Fukuoka 2019 between the Immersive Web Group, the APA working group, and authors of the AOM spec, Alice Boxhall (Google), James Craig (Apple) - special thanks are due to Nell Waliczek (Amazon) who gave an overview of the process behind current XR rendering which informed much of the related technical detail in this document, and Ada Rose Cannon (Samsung) who suggested the idea of standardizing semantic scene graphs for accessibility.

Many thanks to others group members who attended and actively contributed and apologies if you are not listed here. Thanks also to Dominique Hazael-Massieux and for RQTF/APA input from Jason White, as well as Matthew Tylee Atkinson for considered input and editorial advice.

WebXR Standards and Accessibility Architecture Issues