Experiments towards the Immersive Web

Note: this is a personal view by Dave Raggett and not an official position by W3C.

Augmented Reality (AR): Interactive experience combining real world and computer-generated content, e.g. visual, auditory, haptic
Virtual Reality (VR): Interactive experience in a computer generated environment
Extended Reality (XR): A term embracing both augmented and virtual reality
Metaverse: Virtual worlds (or universes) in which users are represented by digital avatars and interact according to the rules of that world, which may depart significantly from the real world
Omniverse: The superset of all universes, both real and imagined
Immersive Web: Open Web standards for extended reality experiences embracing the omniverse

Introduction

This page is the starting point for a series of experiments towards the vision of the Immersive Web, which in turn builds upon an earlier paper from 1994: Extending WWW to support Platform Independent Virtual Reality. That talked about ideas for a VR markup + scripting language (VRML). Following a birds of a feather session at the first international world wide web conference, held in CERN, Switzerland in 1994, Mark Pesce, Tony Parisi and others enthusiastically worked on realising VRML based upon SGI's Open Inventor file format, rather than a markup language akin to HTML. For more details, see Mark's book VRML browsing and building cyberspace, pub. New Riders, 1995.

The Virtual Reality Modelling Language (note change from markup to modelling) was developed further by the Web3D Consortium, evolving into its successor X3D. Whilst VRML benefited from the support from SGI, it was held back by the lack of bandwidth when using dial-up Internet connections typical of the late 1990's. At the same time computer games were rapidly improving with the development of GPUs for personal computers and games consoles. The parallel processing capabilities of GPUs were subsequently exploited for mining cryptocurrencies and artificial neural networks, powering modern AI.

Turning the clock forward, W3C is now working on enabling web browsers to benefit from the capabilities of GPUs and related tensor processors through work on WebGPU and WebNN. The Immersive Web Working Group is chartered to help bring high-performance Virtual Reality (VR) and Augmented Reality (AR) (collectively known as XR) to the open Web via APIs to interact with XR devices and sensors in browsers. WebXR is a web-based API that enables the creation of immersive experiences by integrating augmented reality and virtual reality into web applications. It allows web content and applications to interface with mixed reality hardware such as VR headsets and AR glasses, together with games controllers.

As stated in their explainer, the Immersive Web Working Group is not working on VR/AR browsers and building “The Metaverse”. Meanwhile other groups are considering how best to enable accessibility of extended reality applications, see the XRAccessibility Project and the W3C's Accessible Platform Architectures Working Group.

Accessibility features prominently in the talk cited in the first paragraph above. That introduces the role of declarative and procedural models at a layer above the APIs provided by WebGPU, WebNN and WebXR. Accessibility is enabled through intent-based interaction that describes the what (i.e. aims) rather than the how. Users can then choose how they interact with applications according to their needs. Some people may be happy using games controllers, whilst others with less dexterity opt to use high level voice commands. People with reduced vision get audio scene descriptions, whilst people with speach impediments can communicate via the keyboard.

Users may be using a desktop computer or slouched on a sofa using their phone, but their avatar should be rendered as a fully body, walking around, gesturing and so forth at their user's behest. For this we can use the device's camera to provide a video stream of the user's face as a basis for real-time facial reenactment with the user's choice of avatar. More prosaically, the avatar should show the same facial gestures as its user. Immersive meetings in extended reality are supported using streaming over WebSockets and WebRTC. For scalability, different servers support different spatial localities, using a spatial query mechanism to direct browsers to the correct servers.

The behavior of avatars for people, agents or digital twins can be modelled using chunks and rules, which uses a brain inspired cognitive approach to sequential reasoning, decoupled from real-time control. This can be used to map high level intents to lower level behaviors. The approach supports abstract messaging between agents via addressee name or topic name, along with tasks, task delegation and synchronisation as an abstraction, see the GIECS 2025 talk Web-based monitoring, orchestration and simulation for more details.

A new flexible file format (*.imw) has been introduced to allow for experimentation with information beyond that held in existing 3D formats like GLB. IMW files consist of a sequence of objects, each with a set of named properties, whose values are names, numbers or lists thereof. Files can be delivered compressed with gzip. IMW supports materials, prototypes and instances, meshes, links to scripts for dynamically generated models, bones and joints. Further extensions are anticipated, e.g. for level of detail control, shadow textures, behaviours and so forth.

Implementation Roadmap

The full realisation of the Immersive Web vision will be hard, and will need to be approached using a sequence of small incremental steps. WebGPU's compute pipelines can be used with spatial indexing and level of detail control to feed vertex and fragment shaders. Instancing provides a means to apply the same 3D mesh to many instances with their own animations, transforms and textures. Shadow maps can be used for adding shadows to foreground objects, along with baked shadows (shadow textures) for static objects.

The biggest challenge is to support real-time facial reenactment on everyday devices. The plan is to apply federated learning in the browser for crowd-sourced autoregressive blend shape training. Blend shapes describe vertex displacements for facial expressions, e.g. the movement of the corners of the mouth. The ambition is to allow the models to learn blend shapes without needing explicit definitions, and to combine this with albedo and normal maps for reduced polygon counts. The approach uses a combination of WebGPU and WebNN for acceleration.

The WebNN API is designed for inference, not training. The solution under development is to start with a simple syntax for neural network models that are processed to generate the JavaScript WebNN code for a) running the model forward, and b) back propagating the loss function by mapping the model to its inverse, thereby updating the model parameters in a forward pass through the inverse model. This work will avoid the need for reliance on huge libraries like TensorFlow.js and ONNX.

The neural network model syntax (*.nnm) needs to a) declare the tensor shapes and datatypes, b) define how the named tensors are related through standardised operations, and c) supply the initial values for the input tensors. We can use the same names as WebNN for the datatypes and operations, and avoid the need for quotemarks.

As work proceeds further links will be added above.