Region Capture

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MUST and MUST NOT in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

This document uses the definition of the following concepts from [SCREEN-CAPTURE]: display-surface and browser display-surface.

This specification defines self-capture as the capture of a browser display-surface that is the rendered form of the top-level browsing context of the associated Document of the MediaDevices object from which the application initiated the capture session. A self-capture video track is a MediaStreamTrack sourced by self-capture.

Complex applications often comprise multiple documents in distinct iframes, all displayed within the same browsing context. Consider such an application. Assume one of these documents, CAPTURING-DOC uses getDisplayMedia() or getViewportMedia to capture the entire current browsing context. If this document then wishes to crop the video track to the coordinates of some sub-section CAPTURE-TARGET of a collaborating document CAPTURED-DOC, how can CAPTURING-DOC do so performantly and reliably? Recall especially that changes in layout due to scrolling, zooming or window resizing present additional challenges.

Consider a combo-application consisting of two major parts hosted in different iframes within the same tab - a video-conferencing application and a productivity-suite application. Assume the video-conferencing uses existing/upcoming APIs such as getDisplayMedia() and/or getViewportMedia and captures the entire tab. Now it needs to crop away everything other than a particular section of the productivity-suite. It needs to crop away its own video-conferencing content, any speaker notes and other private and/or irrelevant content in the productivity-suite, before transmitting the resulting cropped video remotely.

Moreover, consider that it is likely that the two collaborating applications are cross-origin from each other. They can post messages, but all communication is asynchronous, and it's easier and more performant if information is transmitted sparingly between them. That precludes solutions involving posting of entire frames, as well as solutions which are too slow to react to changes in layout (e.g. scrolling, zooming and window-size changes).

It is worthwhile to note that most applications would likey prefer to use getViewportMedia in such scenarios. However, as of this writing, getViewportMedia is still unspecified and unimplemented. It will have non-trivial requirements whose adoption will take some time and effort. As such, many applications will likely use a combination of getDisplayMedia() and Region Capture for some time to come.

The combination of getDisplayMedia() and Region Capture is also useful for applications that allow the users to choose whichever display-surface they wish, but offer distinct functionality depending on whether users choose to self-capture or, conversely, choose to capture a window or monitor. Such applications would only succeed in using Region Capture if the user chose to self-capture; otherwise, the attempt to apply cropping would be a no-op.

As presently defined, cropTo(cropTarget) returns a rejected Promise if the cropTarget is not associated with an Element within either the current top-level browsing context or any of its descendant browsing contexts. That means that all of the mechanisms introduced by this document are only relevant for self-capture. An immediate corollary is that navigation of the (shared) top-level browsing context breaks off the capture, and therefore also the cropping session.

The region-capture mechanism comprises two parts:

CropTarget production: A mechanism for tagging an Element as a potential target for the cropping mechanism.
Cropping mechanism: A mechanism for instructing the user agent to start cropping a video track to the contours of a previously tagged Element, or to stop such cropping and revert a track to its uncropped state.

We define two crop-states for video tracks - cropped and uncropped. Tracks start out uncropped, and may turn to cropped when cropTo is successfully called on them.

The cropping mechanism presented in this document (cropTo) relies on Crop-session Target rather than on direct node references. This serves a dual purpose.

It allows cropping by one document to coordinates specified in another document.
Tagging an Element as a potential crop-target allows the user agent to avoid unnecessary work on all other elements, like the calculation of bounding boxes and sending such coordinates cross-process.

CropTarget is an intentionally empty, opaque identifier that exposes nothing. Its sole purpose is to be handed to cropTo as input.

WebIDL[Exposed=(Window,Worker), Serializable]
interface CropTarget {
  // Intentionally empty; just an opaque identifier.
};

Note

There is no consensus yet on the name for CropTarget. This is under discussion in issue #18.

To create a CropTarget with element as input, run the following steps:

Let cropTarget be a new object of type CropTarget.
Let weakRef be a weak reference to element.

Create cropTarget.[[Element]] initialized to weakRef.

Note

cropTarget keeps a weak reference to the element it represents. In other words, cropTarget will not prevent garbage collection of its element.

CropTarget objects are serializable. The serialization steps, given value, serialized, and a boolean forStorage, are:

If forStorage is true, throw with a new DOMException object whose name attribute has the value "DataCloneError".
Set serialized.[[CropTargetElement]] to value.[[Element]].

The deserialization steps, given serialized and value are:

Set value.[[Element]] to serialized.[[CropTargetElement]].

WebIDLpartial interface MediaDevices {
  Promise<CropTarget>
  produceCropTarget(Element element);
};

produceCropTarget()

Calling produceCropTarget on an Element of a supported type associates that Element with a CropTarget. This CropTarget may be used as input to cropTo. We define a valid CropTarget as one returned by a previous call to produceCropTarget() in the current top-level browsing context or any of its descendant browsing contexts.

When produceCropTarget is called on a given element, the user agent creates a CropTarget with element as input. The user agent MUST return a Promise p. The user agent MUST resolve p only after it has finished all the necessary internal propagation of state associated with the new CropTarget, at which point the user agent MUST be ready to receive the new CropTarget as a valid parameter to cropTo.

When cloning an Element on which produceCropTarget was previously called, the clone is not associated with any CropTarget. If produceCropTarget is later called on the clone, a new CropTarget will be assigned to it.

Note

There is no consensus yet on the following issues:

Whether produceCropTarget() should be exposed on instances of MediaDevices or on instances of Element. This is under discussion in issue #11.
Whether produceCropTarget() should return a CropTarget or a Promise<CropTarget>. This is under discussion in issue #17.

Recall that, as per [SCREEN-CAPTURE], when getDisplayMedia() is called, it returns a Promise<MediaStream>, and that this MediaStream contains exactly one video track, whose type is MediaStreamTrack.

We specify that if the user chooses to capture a browser display-surface, the user agent MUST instantiate the video track as either MediaStreamTrack, or as some sub-class of MediaStreamTrack, and that cropTo MUST be exposed on this track. For simplicity's sake, this document assumes that a subclass called BrowserCaptureMediaStreamTrack is used by the user agent.

The track MUST be initially uncropped.

WebIDL[Exposed = Window]
interface BrowserCaptureMediaStreamTrack : MediaStreamTrack {
  Promise<undefined> cropTo(CropTarget? cropTarget);
  BrowserCaptureMediaStreamTrack clone();
};

cropTo()

Calls to this method instruct the user agent to start/stop cropping a self-capture video track to the bounding client rectangle of cropTarget.[[Element]]. Since the track is restricted to the visible viewport of the display-surface, the captured area will be the intersection of the visible viewport and the element bounding client rectangle. Whenever cropTo is invoked, the user agent MUST execute the following algorithm:

If this is not a self-capture video track, the user agent MUST return a new Promise, rejected with an NotSupportedError.
The user agent MUST validate cropTarget according to this track's current crop-state.
- If this track is uncropped, the user agent MUST only accept valid CropTargets.
- If this track is cropped, the user agent MUST accept either valid CropTargets or undefined.
If the user agent does not accept cropTarget, return a Promise rejected with an UnknownError.
Let p be a new Promise.
Run the following steps in parallel:
1. If cropTarget is neither undefined nor a valid CropTarget, reject p with a NotAllowedError and abort these steps.
2. If cropTarget is either undefined or a valid CropTarget, the user agent MUST update this video track's crop-state according to cropTarget:
  - If cropTarget is set to undefined, the user agent MUST stop cropping. This video track reverts to the uncropped state.
  - If cropTarget is a valid CropTarget, the user agent MUST start cropping this video track to the contours of the element referenced by this CropTarget. This means that for each new frame produced on the track, the user agent calculates the bounding box of the pixels belonging to the element, and crops the frame to the coordinates of this bounding box.
3. Call the track's state before this method invocation PRE-STATE, and after this method invocation POST-STATE. The user agent MUST resolve p when it is guaranteed that no more frames cropped (or uncropped) according to PRE-STATE have been delivered to the application, and that any additional frames delivered to the application will therefore be cropped (or uncropped) according to either POST-STATE or a later state.
  
  Note
  
  The timing of the cropTo promise resolution and the timing of the actual cropping of video frames is observable to JavaScript through MediaStreamTrack transforms. It is expected that the first newly cropped video frame will be enqueued on the MediaStreamTrack ReadableStream just after the cropTo promise is resolved.
Return p.

clone()

When a BrowserCaptureMediaStreamTrack is cloned, the user agent MUST produce a track which is initially uncropped, regardless of the crop-state of the original track.

We define an Element for which a CropTarget was produced (through a call to produceCropTarget) as a potential crop-target.

We define a potential crop-target which is targeted by a successful call to cropTo as the crop-session target.

Consider a frame produced on a cropped video track. The user agent calculates the intersection of (i) the top-level browsing context's viewport and (ii) the bounding box of all pixels belonging to the crop-session target. This intersection is defined as the crop-session target's coordinates for that frame.

Consider a video track VT cropped to a given crop-session target TARGET. We define the behavior of the crop-session of the VT in the face of changes undergone by TARGET.

We define as an empty crop-session target the case where a crop-session target is attached to the DOM, yet consists of zero pixels which are drawn inside of the top-level browsing context's viewport.

Note

Some examples of when this could happen include:

The crop-session target consists of zero pixels.
The browsing context's viewport has been scrolled and the crop-session target now lies outside of the viewport.

The user agent MUST NOT produce new frames on tracks with an empty crop-session target. For such a track, the user agent MUST resume the production of frames if the track either become uncropped, or if its crop-session target stops being empty.

We define as disconnected crop-session target a crop-session target that had been detached from the DOM.

The difference between an empty crop-session target and a disconnected crop-session target, is that a disconnected one may become unreachable, in which case it would not produce any new frames. Nevertheless, the user agent MUST treat a disconnected crop-session target the same way it treats an empty crop-session target. The application may call cropTo on the track with either undefined or a new CropTarget, thereby allowing the production of frames on the track to be resumed.

Code in the capture-target:

const mainContentArea = navigator.getElementById('mainContentArea');
const cropTarget = await navigator.mediaDevices.produceCropTarget(mainContentArea);
sendCropTarget(cropTarget);

function sendCropTarget(cropTarget) {
  // Can send the crop-target to another document in this tab
  // using postMessage() or using any other means.
  // Possibly there is no other document, and this is just consumed locally.
}

Code in the capturing-document:

async function startCroppedCapture(cropTarget) {
  const stream = await navigator.mediaDevices.getDisplayMedia();
  const [track] = stream.getVideoTracks();
  if (!!track.cropTo) {
    handleError(stream);
    return;
  }
  await track.cropTo(cropTarget);
  transmitVideoRemotely(track);
}

[dom]: DOM Standard. Anne van Kesteren. WHATWG. Living Standard. URL: https://dom.spec.whatwg.org/
[HTML]: HTML Standard. Anne van Kesteren; Domenic Denicola; Ian Hickson; Philip Jägenstedt; Simon Pieters. WHATWG. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[mediacapture-streams]: Media Capture and Streams. Cullen Jennings; Bernard Aboba; Jan-Ivar Bruaroey; Henrik Boström; youenn fablet. W3C. 10 March 2022. W3C Candidate Recommendation. URL: https://www.w3.org/TR/mediacapture-streams/
[RFC2119]: Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc2119
[RFC8174]: Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba. IETF. May 2017. Best Current Practice. URL: https://www.rfc-editor.org/rfc/rfc8174
[SCREEN-CAPTURE]: Screen Capture. Martin Thomson; Keith Griffin; Suhas Nandakumar; Henrik Boström; Jan-Ivar Bruaroey; Elad Alon. W3C. 17 March 2022. W3C Working Draft. URL: https://www.w3.org/TR/screen-capture/
[webidl]: Web IDL Standard. Edgar Chen; Timothy Gu. WHATWG. Living Standard. URL: https://webidl.spec.whatwg.org/

Region Capture

Abstract

Status of This Document

1. Conformance

2. Definitions

3. Use Cases

3.1 Generic Use-Case

3.2 Practical Use-Case

4. Scope

5. Solution Overview

6. CropTarget Production

6.1 CropTarget Motivation

6.2 `CropTarget` Definition

6.3 MediaDevices.produceCropTarget

7. Cropping Mechanism

7.1 BrowserCaptureMediaStreamTrack

7.2 Crop-Session Lifetime

7.2.1 Crop-Session Definitions

7.2.2 Crop-Session Edge Cases

7.2.2.1 Empty Crop-Target

7.2.2.2 Disconnected Crop-Session Target

8. Sample Code

A. References

A.1 Normative references

Region Capture

Abstract

Status of This Document

1. Conformance

2. Definitions

3. Use Cases

3.1 Generic Use-Case

3.2 Practical Use-Case

4. Scope

5. Solution Overview

6. CropTarget Production

6.1 CropTarget Motivation

6.2 CropTarget Definition

6.3 MediaDevices.produceCropTarget

7. Cropping Mechanism

7.1 BrowserCaptureMediaStreamTrack

7.2 Crop-Session Lifetime

7.2.1 Crop-Session Definitions

7.2.2 Crop-Session Edge Cases

7.2.2.1 Empty Crop-Target

7.2.2.2 Disconnected Crop-Session Target

8. Sample Code

A. References

A.1 Normative references

6.2 `CropTarget` Definition