Media Capture Depth Stream Extensions

Abstract

This specification extends the Media Capture and Streams specification [GETUSERMEDIA] to allow a depth-only stream or combined depth+video stream to be requested from the web platform using APIs familiar to web authors.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

The following substantial changes were made since the W3C Working Draft 29 January 2015:

Clarified Terminology and Dependencies
Added MediaTrackConstraints dictionary
Defined Media provider object behavior for a depth-only stream
Removed CanvasImageSource typedef extensions
Removed ImageData interface extensions
Defined the video element behavior for a depth-only stream and depth+video stream
Defined the algorithm to convert the depth map value to grayscale
Added MediaTrackSettings dictionary and removed the Settings dictionary
Updated Examples

This document is not complete and is subject to change. Early experimentations are encouraged to allow the Media Capture Task Force to evolve the specification based on technical discussions within the Task Force, implementation experience gained from early implementations, and feedback from other groups and individuals.

This document was published by the Device APIs Working Group and the Web Real-Time Communications Working Group as a Working Draft. This document is intended to become a W3C Recommendation. If you wish to make comments regarding this document, please send them to public-media-capture@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by groups operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures (Device APIs Working Group) and a public list of any patent disclosures (Web Real-Time Communications Working Group) made in connection with the deliverables of each group; these pages also include instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 September 2015 W3C Process Document.

1. Introduction

Depth cameras are increasingly being integrated into devices such as phones, tablets, and laptops. Depth cameras provide a depth map, which conveys the distance information between points on an object's surface and the camera. With depth information, web content and applications can be enhanced by, for example, the use of hand gestures as an input mechanism, or by creating 3D models of real-world objects that can interact and integrate with the web platform. Concrete applications of this technology include more immersive gaming experiences, more accessible 3D video conferences, and augmented reality, to name a few.

To bring depth capability to the web platform, this specification extends the MediaStream interface [GETUSERMEDIA] to enable it to also contain depth-based MediaStreamTracks. A depth-based MediaStreamTrack, referred to as a depth stream track, represents an abstraction of a stream of frames that can each be converted to objects which contain an array of pixel data, where each pixel represents the distance between the camera and the objects in the scene for that point in the array. A MediaStream object that contains one or more depth stream tracks is referred to as a depth-only stream or depth+video stream.

Depth cameras usually produce 16-bit depth values per pixel. However, neither the canvas drawing surface used to draw and manipulate 2D graphics on the web platform nor the ImageData interface used to represent image data support 16 bits per pixel. To address the issue, this specification defines a conversion into a 8-bit grayscale representation of a depth map for consumption by APIs that are limited to 8 bits per pixel.

The Media Capture Stream with Worker specification [MEDIACAPTURE-WORKER] that complements this specification enables processing of 16-bit depth values per pixel directly in a worker environment and makes the <video> and <canvas> indirection and depth-to-grayscale conversion redundant. This alternative pipeline that supports greater bit depth and does not incur the performance penalty of the indirection and conversion enables more advanced use cases.

5. Terminology

The term depth+video stream means a MediaStream object that contains one or more MediaStreamTrack objects of kind "depth" (depth stream track) and one or more MediaStreamTrack objects of kind "video" (video stream track).

The term depth-only stream means a MediaStream object that contains one or more MediaStreamTrack objects of kind "depth" (depth stream track) only.

The term video-only stream means a MediaStream object that contains one or more MediaStreamTrack objects of kind "video" (video stream track) only, and optionally of kind "audio".

The term depth stream track means a MediaStreamTrack object whose kind is "depth". It represents a media stream track whose source is a depth camera.

The term video stream track means a MediaStreamTrack object whose kind is "video". It represents a media stream track whose source is a video camera.

5.1 Depth map

A depth map is an abstract representation of a frame of a depth stream track. A depth map is an image that contains information relating to the distance of the surfaces of scene objects from a viewpoint.

A depth map has an associated focal length which is a double. It represents the focal length of the camera in millimeters.

A depth map has an associated horizontal field of view which is a double. It represents the horizontal angle of view in degrees.

A depth map has an associated vertical field of view which is a double. It represents the vertical angle of view in degrees.

A depth map has an associated unit which is a string. It represents the active depth map unit.

A depth map has an associated near value which is a double. It represents the minimum range in active depth map units.

A depth map has an associated far value which is a double. It represents the maximum range in active depth map units.

6. Extensions

6.1 `MediaStreamConstraints` dictionary

partial dictionary MediaStreamConstraints {
    (boolean or MediaTrackConstraints) depth = false;
};

If the depth dictionary member has the value true, the MediaStream returned by the getUserMedia() method MUST contain a depth stream track. If the depth dictionary member is set to false, is not provided, or is set to null, the MediaStream MUST NOT contain a depth stream track.

6.2 `MediaTrackConstraints` dictionary

enum DepthMapUnit {
    "mm",
    "m"
};

The DepthMapUnit enumeration represents the possible depth map units for a depth map. The "mm" value indicates millimeters, the "m" value indicates meters.

partial dictionary MediaTrackConstraints {
    DepthMapUnit unit = "mm";
};

If the unit dictionary member value is one of the possible depth map units, it becomes the active depth map unit for the depth stream track. Otherwise, the active depth map unit is "mm".

6.3 `MediaStream` interface

partial interface MediaStream {
    sequence<MediaStreamTrack> getDepthTracks();
};

The getDepthTracks() method, when invoked, MUST return a sequence of depth stream tracks in this stream.

The getDepthTracks() method MUST return a sequence that represents a snapshot of all the MediaStreamTrack objects in this stream's track set whose kind is equal to "depth". The conversion from the track set to the sequence is user agent defined and the order does not have to be stable between calls.

6.3.1 Implementation considerations

This section is non-normative.

A video stream track and a depth stream track can be combined into one depth+video stream. The rendering of the two tracks are intended to be synchronized. The resolution of the two tracks are intended to be same. And the coordination of the two tracks are intended to be calibrated. These are not hard requirements, since it might not be possible to synchronize tracks from sources.

6.4 `MediaStreamTrack` interface

The kind attribute MUST, on getting, return the string "depth" if the object represents a depth stream track.

6.5 Media provider object

A media provider object can represent a depth-only stream (and specifically, not a depth+video stream). The user agent MUST support a media element with an assigned media provider object that is a depth-only stream, and in particular, the srcObject IDL attribute that allows the media element to be assigned a media provider object MUST, on setting and getting, behave as specified in [HTML].

6.6 The `video` element

For a video element whose assigned media provider object is a depth-only stream, the user agent MUST, for each pixel of the media data that is represented by a depth map, convert the depth map value to grayscale prior to when the video element is potentially playing.

For a video element whose assigned media provider object is a depth+video stream, the user agent MUST act as if all the MediaStreamTracks of kind "depth" were removed prior to when the video element is potentially playing.

The algorithm to convert the depth map value to grayscale, given a depth map value d, is as follows:

Let bit depth be the bit depth of the depth map.
Let near be the the near value.
Let far be the the far value.
If bit depth is greater than 8, then apply the rules to convert using range inverse to d to obtain quantized value d_8bit.
Otherwise, apply the rules to convert using range linear to d to obtain quantized value d_8bit.
Return d_8bit.

The rules to convert using range inverse are as given in the following formula:

The rules to convert using range linear are as given in the following formula:

6.6.1 `VideoTrack` interface

For each depth stream track in the depth-only stream, the user agent MUST create a corresponding VideoTrack as defined in [HTML].

6.7 `MediaTrackSettings` dictionary

When the getSettings() method is invoked on a depth stream track, the user agent MUST return the following dictionary that extends the MediaTrackSettings dictionary:

enum RangeFormat {
    "inverse",
    "linear"
};

partial dictionary MediaTrackSettings {
    double        focalLength;
    RangeFormat   format;
    double        horizontalFieldOfView;
    double        verticalFieldOfView;
    DepthMapUnit? unit;
    double        near;
    double        far;
};

The focalLength dictionary member represents the depth map's focal length.

The format dictionary member represents the depth to grayscale conversion method applied to the depth map. If the value is "inverse", the rules to convert using range inverse are applied, and if the value is "linear", the rules to convert using range linear are applied.

The horizontalFieldOfView dictionary member represents the depth map's horizontal field of view.

The verticalFieldOfView dictionary member represents the depth map's vertical field of view.

The unit dictionary member represents the active depth map unit.

The near dictionary member represents the depth map's near value.

The far dictionary member represents the depth map's far value.

6.8 `WebGLRenderingContext` interface

6.8.1 Implementation considerations

This section is non-normative.

A video element whose source is a MediaStream object containing a depth stream track may be uploaded to a WebGL texture of format RGB and type UNSIGNED_BYTE. [WEBGL]

For each pixel of this WebGL texture, the R component represents the lower 8 bit value of 16 bit depth value, the G component represents the upper 8 bit value of 16 bit depth value and the value in B component is not defined.

Media Capture Depth Stream Extensions

W3C Working Draft 21 January 2016

Abstract

Status of This Document

Table of Contents

1. Introduction

2. Use cases and requirements

3. Conformance

4. Dependencies

5. Terminology

5.1 Depth map

6. Extensions

6.1 `MediaStreamConstraints` dictionary

6.2 `MediaTrackConstraints` dictionary

6.3 `MediaStream` interface

6.3.1 Implementation considerations

6.4 `MediaStreamTrack` interface

6.5 Media provider object

6.6 The `video` element

6.6.1 `VideoTrack` interface

6.7 `MediaTrackSettings` dictionary

6.8 `WebGLRenderingContext` interface

6.8.1 Implementation considerations

7. Examples

Playback of depth+video stream

WebGL Fragment Shader based post-processing

A. Acknowledgements

B. References

B.1 Normative references

B.2 Informative references

Abstract

Status of This Document

Table of Contents

1. Introduction

2. Use cases and requirements

3. Conformance

4. Dependencies

5. Terminology

5.1 Depth map

6. Extensions

6.1 MediaStreamConstraints dictionary

6.2 MediaTrackConstraints dictionary

6.3 MediaStream interface

6.3.1 Implementation considerations

6.4 MediaStreamTrack interface

6.5 Media provider object

6.6 The video element

6.6.1 VideoTrack interface

6.7 MediaTrackSettings dictionary

6.8 WebGLRenderingContext interface

6.8.1 Implementation considerations

7. Examples

Playback of depth+video stream

WebGL Fragment Shader based post-processing

A. Acknowledgements

B. References

B.1 Normative references

B.2 Informative references

6.1 `MediaStreamConstraints` dictionary

6.2 `MediaTrackConstraints` dictionary

6.3 `MediaStream` interface

6.4 `MediaStreamTrack` interface

6.6 The `video` element

6.6.1 `VideoTrack` interface

6.7 `MediaTrackSettings` dictionary

6.8 `WebGLRenderingContext` interface