Media Capture Depth Stream Extension

From W3C Wiki

Abstract

This document discusses use cases and requirements for accessing depth stream from a depth camera. It is assumed this functionality will be defined in an extension specification layered on top of the existing getUserMedia API.

Overview

Depth cameras have become popular in the consumer market with devices such as the Kinect [1] and Creative Senz3D [2]. There are three main depth sensing technologies:

  1. stereo cameras [3],
  2. time-of-flight cameras [4] and
  3. structured light cameras [5].

Based on these technologies, depth cameras are able to provide a depth map, which provides the distance information between points on an object's surface and the camera. Some cameras are also able to provide a synchronized RGB image along with a depth map. According to [6], depth information is able to enhance the RGB-only computer vision in robustness and performance. As discussed in [7], the depth information can also be used for "free viewpoint" video rendering that can help create immersive conferencing.

Use Cases

An HTML5 "Fruit Ninja" style video game with hand gesture
  • UC1: An HTML5 "Fruit Ninja" style Video Game with Hand Gesture (depth stream processing).

Fruit Ninja is a popular video game. In this game, users are tasked to cut the flying on-screen fruit. At same time, users must try to avoid cutting any bombs that fly by which will then explode and end the game. An HTML5 "Fruit Ninja" style video game is able to run on different devices with different input methods. On traditional personal computers, users use their mouse or pointing device to control the blade and cut the flying fruit. On phones or tablets, users use their touch screen to cut the flying fruit. On devices with depth camera, users could wave their bare hands to cut the flying fruit making the game more physical and interactive.

Bob launches the HTML5 "Fruit Ninja" style video game in a browser running on a device connected to a depth camera. The browser notifies Bob that this site wants to access the depth camera. Bob grants permission for this access. Then the game's user interface instructs Bob that he can now wave his hand around in front of the depth camera to cut the flying fruit. The game's user interface can also blend a contour of Bob's body into the game's scene to help him locate his position and improve the game play. Bob then waves one of his hands to cut an apple on the start screen in order to start the game. Bob is now able to interact with the game by waving his hands around to cut the flying fruit.

For another example of a video game with hand gesture tracking, see the demonstration video. The Google Glass team have also released a game that delivers this type of interaction called Shape Splitter. While this currently uses image processing, it shows that this interaction model is very relevant in wearable devices too.

Requirements (see below): PD1, LMD1, LMD2 and LMD3.


  • UC2: 3D object scan (RGB and depth fusion).

Alice is an artist producing handicrafts and selling them via an online consumer-to-consumer shopping site. The shopping site enables Alice to capture and upload 2D images of her products using a standard RGB camera. And this site also allows users to create 3D models of their products if a combined "RGB and depth" camera is available. After Alice clicks the "create a 3D model" button of the shopping site, the browser asks for permission to access the combined color and depth camera. Alice grants permission for this access. Then the site shows Alice a preview widget which fuses the RGB video and depth map into a 3D scene. The site asks Alice to move the combined colour and depth camera around to scan her product. The preview widget shows an indicator to let Alice know if she is moving is too fast or too slow. After the scanning is complete the site creates a 3D model with a color texture map of Alice's product. After Alice checks and approves her 3D model, it is uploaded to the shopping site. Now potential customers are able to view a rich 3D representation of Alice's product.

Requirements (see below): PD1, LMD1, LMD2, LMD3, LMD4 and LMD5.


  • UC3: Immersive 3D video conference (depth streaming).

Alice and Bob usually use a video chat site (based on WebRTC technology) to talk each other. This site is upgraded to support a "3D chat room" feature. Alice and Bob decide to try this new feature. Alice first clicks "Enter the 3D Chat Room" button on this site. Alice's browser asks for permission to access the combined RGB and depth camera that is attached to her computer. Alice grants permission for this access. Then the site fuses the local RGB stream with the depth map stream to project them into a local 3D scene. Alice's browser also streams the RGB stream and depth stream to Bob's browser. Bob's browser recieves the remote RGB stream and depth map stream from Alice and projects them into a local 3D scene. Bob then does the same in his browser. After that, both Alice and Bob are able to see each other in their respective local 3D scenes. They are also free to change the view point of the 3D scene of the video conference.

Requirements (see below): PD1, LMD1, LMD2, LMD3, LMD4, LMD5 and RMD1.


  • UC4: Accessible 3D video conference (depth streaming and stream processing).

Dave is visually impaired and regularly uses a screen reader to browse the web. Kate is hearing impaired and has a combined RGB and depth camera attached to her computer. Dave and Kate wish to communicate using an accessible 3D video conference. Dave connects to the conference site and enters a new chat room in text only mode. Kate then connects to the same conference site and enters the same chat room. Kate's browser asks for permission to access her combined RGB and depth camera. Kate grants permission for this access. Kate now sees a local video preview of herself and then configures the site to enter a "translation mode" that translates her gestures using American Sign Language into text that is then transmitted to Dave. Dave's screen reader then converts this text to speech and Dave can reply by using a braille enhanced keyboard, or even a speech-to-text convertor. They are now both easily able to communicate using this accessible 3D video conference.

Requirements (see below): PD1, LMD1, LMD2, LMD3, LMD4, LMD5 and RMD1.


  • UC5: Mobile Augmented Reality and Spatial Scanning

Mary has a mobile device that includes a built-in depth camera (see Project Tango[9] for one example). Using this device she can run a web application that uses the depth map of the space around her in order to place virtual objects and information. The screen or display on her mobile device show the combination of the background video of the world she sees with the virtual objects and information overlaid on top. This depth map allows the web application to more accurately detect planes, surfaces and objects so it can seamlessly blend the virtual content into the background video making the overall user experience more compelling and believable.

This general use case applies to a whole category of applications ranging from games (e.g. bouncing virtual balls off the ground or wall planes in front of you) through to detailed spatial analysis applications (e.g. measuring rooms and presenting information about hidden infrastructure and services like plumbing and electrical cabling) and more.

Requirements (see below): PD1, LMD1, LMD2, LMD3, LMD4, LMD5 and in some cases RMD1.


  • UC6: "Green screen" style video conferencing

Jane is in business travel. When waiting for boarding in terminal, she is going to have a video conference with her colleagues. Jane has a mobile device with built-in depth camera. She launches the video conferencing web application on this device. Since Jane is in public area, she would like to turn on the "background removal" function of the video conferencing web application. The web application requests the access to color and depth cameras. Jane allows the requests. By using the depth calibrated RGB data and depth data, the the web application is able to show a live video recording of Jane without background. Jane also sets the background of the video recording with pre-captured office cube image. After Jane is OK with the setting, Jane connects to the on-line meeting room. In the video conferencing, Jane's colleagues can only see Jane with the office cube image. It makes the video conferencing focused and effective.

Requirements (see below): PD1, LMD1, LMD3, LMD4, LMD5. And it also requires the media streams pre-processing capabilities, such as Media Capture from DOM Elements.

Requirements

  • Permission for Depth Stream (PD)
  1. The UA must request the user's specific permission before accessing the depth camera.
  • Local Media for Depth Stream (LMD)
  1. The UA must enable a web application to request depth map stream from a depth camera.
  2. The UA must allow a web application to show the depth map stream on the screen.
  3. The UA must enable a web application to decode the depth value of each pixel in the depth map.
  4. The UA must allow a web application to specify the depth-stream-aligned RGB stream request.
  5. The UA must keep the RGB stream synchronized and calibrated with the depth stream.
  • Remote Media for Depth Stream (RMD)

# The UA must enable a web application to transmit the depth stream to a remote host without any significant loss of accuracy. (deferred to a future version)

Specification

Examples

1) Request a depth stream without a RGB stream (PD1, LMD1):

navigator.getUserMedia({ depth: true }, success, failure);

function success(s) {
  var depthTrack = s.getDepthTracks()[0];
  console.log(depthTrack.kind); // prints “depth"
}

2) Request a depth stream (PD1, LMD1) and a calibrated RGB video stream (LMD4, LMD5) contained within a single MediaStream, display the video stream and the depth map stream locally (LMD2), and send the depth stream to a remote host (RMD1) (RTC support deferred to a future version):

(The streams are calibrated if requested together. The behavior is conceptually similar to how camera and microphone inputs are synchronized.)

navigator.getUserMedia({ video: true, depth: true }, success, failure);

function success(s) {
  // wire the stream into a <video> element for playback
  // ISSUE: should the implementations allow the user to
  // switch between the RGB video and depth stream playback
  // if the stream contains both video and stream tracks?
  var video = document.querySelector('#video');
  video.src = URL.createObjectURL(s);
  video.play();

  // construct a new MediaStream out of the existing depth track(s)
  var depthStream = new MediaStream(s.getDepthTracks()[0]);

  // (not supported, for future work)
  // send the newly created depth stream over a RTCPeerConnection
  // var peerConnection = new RTCPeerConnection(config);
  // peerConnection.addStream(depthStream);
 
  // wire the depth stream into another <video> element for playback
  // NOTE: how the depth information is visualized as a 8-bit grayscale representation
  var depthVideo = document.querySelector('#depthVideo');
  depthVideo.src = URL.createObjectURL(depthStream);
  depthVideo.play();
}

3) Request a depth stream and a RGB video stream in two separate MediaStreams (the streams are not calibrated, as they can be sourced from different cameras):

navigator.getUserMedia({ depth: true }, successDepth, failureDepth);
navigator.getUserMedia({ video: true }, successVideo, failureVideo);

function successDepth(s) {
  console.log(s.getVideoTracks().length); // prints “0" 
  console.log(s.getDepthTracks().length); // prints “1”
  console.log(s.getDepthTracks()[0].kind); // prints “depth"
}

function successVideo(s) {
  console.log(s.getVideoTracks().length); // prints “1" 
  console.log(s.getDepthTracks().length); // prints “0" 
  console.log(s.getVideoTracks()[0].kind); // prints “video"
}

4) The value of each pixel of a 16-bit depth map of a depth stream frame can be read (LMD3) using the APIs provided by the CanvasRenderingContext2D context and ArrayBuffer. However, since the canvas drawing surface used to draw and manipulate 2D graphics on the web platform and the ImageData interface used to represent image data do not support 16 bits per pixel, the depth map values are converted to 8-bit grayscale by the implementation when the MediaStream s containing the depth stream is wired into a HTMLVideoElement video as discussed in the example 2. The width w, height h, and framerate fps of the converted stream are developer-configurable:

setInterval(function() {
  context.drawImage(video, 0, 0, w, h);
        
  var depthData = context.getImageData(0, 0, w, h);
  for (var i = 0; i < depthData.data.length; i++) {
    var depth = depthData.data[i];
    // process the 8-bit grayscale depth value
  }
}, 1000 / fps);

5) WebGL examples (work-in-progress):

// add some WebGL examples here

More Examples

Issues

  • Depth cameras usually produce 16-bit depth values per pixel. However, the 16bpp format is not widely adopted and supported in web platforms. One option is to encode the 16-bit depth value into the 3 8-bit channels (RGB) as proposed in [8]. Alternatively, to have depth data accessible in absolute units, such as linear 16-bit millimeters, or floating-point meters, the logical way to implement this would be as the depth component of a GL texture, which can be UNSIGNED_SHORT or FLOAT in GLES 3.0 (see Benjamin's mail for details).
  • The value in depth map needs to be defined, for example, unit, min_value and max_value.
  • We need to agree on how the depth stream is exposed to the WebGL context. Currently two alternatives have been proposed: i) pack 16-bit depth value in the RGBA format similarly to how we extend ImageData; ii) extend WebGLRenderingContext, upload depth map via DEPTH_COMPONENT.

References

[1]: Kinect for Windows, http://www.microsoft.com/en-us/kinectforwindows/
[2]: Creative Senz3D, http://us.creative.com/p/web-cameras/creative-senz3d
[3]: http://en.wikipedia.org/wiki/Stereo_camera
[4]: S. Gokturk, H. Yalcin, and C. Bamji, A time-of-flight depth sensor system description, issues and solutions, in Proc. IEEE Conf. Computer Vision Pattern Recognition Workshops, 2004, pp. 35–45.
[5]: Geng, Structured-light 3-D surface imaging: A tutorial, Adv. Optics Photonics, vol. 3, no. 2, pp. 128–160, 2011.
[6]: Jungong Han, Ling Shao, Dong Xu, and Jamie Shotton, Enhanced Computer Vision with Microsoft Kinect Sensor: A Review, IEEE Transactions on Cybernetics, Oct. 2013
[7]: K. Muller, P. Merkle, and T. Wiegand, 3D video representation using depth maps, Proceedings of the IEEE, vol. 99, no. 4, pp. 643–656, Apr.2011.
[8]: PECE, F., KAUTZ, J.,AND WEYRICH, T. 2011. Adapting standard video codecs for depth streaming, In Proceedings of the 17th Euro graphics conference on Virtual Environments; Third Joint Virtual Reality, Euro graphics Association, Aire-la-Ville, Switzerland, Switzerland, EGVE - JVRC’11, 59–66, http://web4.cs.ucl.ac.uk/staff/j.kautz/publications/depth-streaming.pdf
[9] Project Tango Mobile device with built in depth camera, http://www.google.com/atap/projecttango/