MediaStreamTrack Content Hints

Abstract

This specification extends MediaStreamTrack to provide a media-content hint attribute. This optional hint permits MediaStreamTrack consumers such as PeerConnection (defined in [webrtc]) or MediaRecorder (defined in [mediastream-recording]) to encode or process track media with methods more appropriate to the type of content that is being consumed.

Adding a media-content hint provides a way for a web application to help track consumers make more informed decision of what encoder parameters and processing algorithms to use on the consumed content.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This document was published by the Web Real-Time Communications Working Group as a First Public Working Draft. This document is intended to become a W3C Recommendation. Comments regarding this document are welcome. Please send them to public-media-capture@w3.org (subscribe, archives) with [mst-content-hint] at the start of your email's subject.

Publication as a First Public Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 February 2018 W3C Process Document.

1. Introduction

Algorithms used for processing speech and music differ greatly. Echo-cancellation algorithms developed for speech-type content might not work well on music, and noise-suppression algorithms might remove drum snares or other "noisy" content. While this makes speech more intelligible it is less appropriate for music signals.

For video, webcam content often require denoising and is often intelligible even when downscaled or with high quantization levels. Screencast content of presentations or webpages with a lot of text content is completely unintelligible if the quantization levels are too high or if the content is downscaled or otherwise blurry.

Without automatic detection of media content, a MediaStreamTrack consumer can only make an educated guess. This guess may be based on assuming that screencast content, such as chrome.desktopCapture, contains text content and must use low quantization levels, and drop frames extensively to meet bitrate requirements. Another assumption is that regular USB video devices provide webcam video, and higher quantization levels and downscaling are acceptible.

While usually appropriate this educated guess leads to sub-optimal settings when incorrect. This manifests as high framedropping when screencasting high-motion content such as a movie or streaming a video game and treating it as text. Treating highly-detailed content as regular webcam video on the other hand leads to too-blurry content when being either quantized or downscaled beyond readability to meet bitrate requirements. This mismatch may also happen when HDMI video-capture cards are seen as USB webcams but actually screencast webpage text.

Lost text intelligibility when downscaling. — Figure 1 While downscaling can be done to preserve motion in low-bitrate scenarios, this example illustrates lost text intelligibility when incorrectly applied to detailed content. Example shows 100%, 50% and 25% cubic downscale corresponding to downscaling from HD to VGA and QVGA resolutions respectively.

In some cases the web application can make a more-educated guess or take user input to inform consumers of what kind of content is being encoded. A web application that streams video-game content would be able to preserve motion from desktop capture at the cost of individual frame detail. A music-studio application would be able to prevent noise suppression from removing snares from a music track.

These settings are not intended to replace encoder-level settings completely but rather complement them with a simpler hint that does not require broad knowledge of video encoders, audio-processing steps or more extensive tuning.

2. Extension to MediaStreamTrack

partial interface MediaStreamTrack {
  attribute DOMString contentHint;
};

This specification extends MediaStreamTrack and makes use of its kind attribute as defined in [GETUSERMEDIA].

Each MediaStreamTrack has an associated application-set content hint, which is initially "", signifying unset. This application-set content hint corresponds to the contentHint attribute of MediaStreamTrack which may be used by the web application to provide a hint of what type of content is contained within the track, to guide how it should be treated by MediaStreamTrack consumers.

Valid values for the application-set content hint are dependent on the kind of MediaStreamTrack contained. On setting contentHint to value,

If this MediaStreamTrack's kind attribute is "audio", and value is not one of "", "speech", or "music", abort these steps.
If this MediaStreamTrack's kind attribute is "video", and value is not one of "", "motion", "detail" or "text", abort these steps.
Set this MediaStreamTrack's application-set content hint to value.
The implementation should adapt its decision on how to handle the content of this MediaStreamTrack according to the new value of its application-set content hint. This adaptation should happen as quickly as reasonable, e.g. within the next couple of captured video frames or audio buffers.

On getting contentHint,

Return this MediaStreamTrack's application-set content hint.

Note that the initial value of application-set content hints is "", corresponding to that no hint has been provided. It does not default to the implementation's best guess of contained type of content.

2.1 Audio Content Hints

Audio content hints are only applicable when the MediaStreamTrack contains an audio track.

Audio content hints
`""`	No hint has been provided, the implementation should make its best-informed guess on how to handle contained audio data. This may be inferred from how the track was opened or by doing content analysis.
`speech`	The track should be treated as if it contains speech data. Consuming this signal it may be appropriate to apply noise suppression or boost intelligibility of the incoming signal.
`music`	The track should be treated as if it contains music data. Generally this might imply tuning or turning off audio-processing components that are used to process speech data to prevent the audio from being distorted.

2.2 Video Content Hints

Video content hints are only applicable when the MediaStreamTrack contains a video track.

Video content hints
`""`	No hint has been provided, the implementation should make its best-informed guess on how contained video content should be treated. This can for example be inferred from how the track was opened or by doing content analysis.
`motion`	The track should be treated as if it contains video where motion is important. This is normally webcam video, movies or video games. Quantization artefacts and downscaling are acceptible in order to preserve motion as well as possible while still retaining target bitrates. During low bitrates when compromises have to be made, more effort is spent on preserving frame rate than edge quality and details.
`detail`	The track should be treated as if video details are extra important. This is generally applicable to presentations or web pages with text content, painting or line art. This setting would normally optimize for detail in the resulting individual frames rather than smooth playback. Artefacts from quantization or downscaling that make small text or line art unintelligible should be avoided.
`text`	The track should be treated as if video details are extra important, and that significant sharp edges and areas of consistent color can occur frequently. This is generally applicable to presentations or web pages with text content. This setting would normally optimize for detail in the resulting individual frames rather than smooth playback, and may take advantage of encoder tools that optimize for text rendering. Artefacts from quantization or downscaling that make small text or line art unintelligible should be avoided.

MediaStreamTrack Content Hints

W3C First Public Working Draft 03 July 2018

Abstract

Status of This Document

1. Introduction

2. Extension to MediaStreamTrack

2.1 Audio Content Hints

2.2 Video Content Hints

A. References

A.1 Normative references

A.2 Informative references