WebRTC Next Version Use Cases

W3C Working Draft

This version:
https://www.w3.org/TR/2020/WD-webrtc-nv-use-cases-20201130/
Latest published version:
https://www.w3.org/TR/webrtc-nv-use-cases/
Latest editor's draft:
https://w3c.github.io/webrtc-nv-use-cases/
Previous version:
https://www.w3.org/TR/2018/WD-webrtc-nv-use-cases-20181211/
Editor:
Bernard Aboba (Microsoft Corporation)
Author:
Participate:
GitHub w3c/webrtc-nv-use-cases
File a bug
Commit history
Pull requests
Participate:
Mailing list

Abstract

This document describes a set of use cases motivating the development of WebRTC Next Version (WebRTC-NV), as well as the requirements derived from those use cases.

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

This document was published by the Web Real-Time Communications Working Group as a Working Draft.

GitHub Issues are preferred for discussion of this specification. Alternatively, you can send comments to our mailing list. Please send them to public-webrtc@w3.org (archives).

Publication as a Working Draft does not imply endorsement by the W3C Membership.

This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 15 September 2020 W3C Process Document.

1. Scope and Motivation

To motivate the development of WebRTC 1.0, the IETF RTCWEB WG developed [RFC7478]. This document describes use cases motivating the development of "WebRTC Next Version" (WebRTC-NV), and the requirements deriving from those use cases. The use cases fall into one of two categories: enhancements to use cases already covered in [RFC7478], and new use cases which are not supported in WebRTC 1.0 [WEBRTC] without extensions.

2. Existing Use Cases

The uses cases in this section improve upon use cases described in [RFC7478].

2.1 Multiparty online game with voice communications

[RFC7478] Section 2.3.12 describes a use case involving a multiparty online game with voice communications. In these scenarios, reducing time to join the game and receive media is important. To minimize this, ICE enhancements are desirable, such as the ability to control candidate gathering and pruning. Also, allowing a participant to broadcast a configuration to a “room” abstraction (maintained on a server), with other room participants responding back directly, avoiding a separate discovery step, minimizes conference establishment time. Also, managing audio quality and latency in a fair manner between multiple connections prevents queue buildup. Supporting this enhancement adds the following requirements:

Requirement ID Description
N01 The user agent can control candidate gathering and pruning, limiting the networks on which candidates are gathered, the types of candidates, etc.
N02 The user agent must be capable of establishing multiple connections to peers without generating a separate configuration ("offer") for each connection prior to establishment.
N03 Congestion control must be able to manage audio quality and latency in a fair manner between multiple connections.

Experience: This use case has been implemented by a gaming service utilizing [ORTC].

References:
  1. ORTC Issue 54
  2. ORTC Issue 603

2.2 Mobile calling service

[RFC7478] Section 2.3.6 describes a simple communications service where the user changes access network during the session. This use case is enhanced by being able to ring multiple endpoints simultaneously, as well as to re-route media over an alternate path (potentially taking network cost into account) without need for signaling.

Requirement ID Description
N02 The user agent must be capable of establishing multiple connections to peers without generating a separate configuration ("offer") for each connection prior to establishment.
N04 The ICE agent must be able to maintain multiple candidate pairs and move traffic between them.
N05 The ICE agent must be able to take the network cost into account when considering re-routing.

References:

  1. Mailing list proposal
  2. Mailing list proposal
  3. ORTC Issue 583

2.3 Video Conferencing with a Central Server

[RFC7478] Section 2.4.3.1 describes a use case involving Multiparty Video Communications with a central conferencing server. In such a use case, clients with disparate capabilities such as differing bandwidth availability, screen size and maximum displayable frame rate may participate in the same conference. In such a situation it is advantageous to support Scalable Video Coding (SVC). Encoding with temporal scalability is supported by several browsers today and is utilized by most centralized conferencing services.

It is expected that spatial scalability (supported by VP9 and AV1) will become more popular with time. In this use case, if the desired video codec is known beforehand and participants are muted by default (as in a very large meeting), it is desirable to allow new participants to start receiving immediately, without negotiation. Supporting this enhancement adds the following requirements:

Requirement ID Description
N06 The user agent must be able to encode and decode video utilizing temporal scalability and (if supported by the chosen codec) spatial scalability.
N07 A user agent can receive audio/video without requiring construction of a corresponding sender object.
N08 It is possible to select the sending and/or receiving codec as well as rtcp parameters and header extensions without negotiation.
N09 The user agent must be able to control robustness (RTX, RED, FEC) applied to individual simulcast and SVC layers.
N24 CSP support for WebRTC.

This use case has been implemented by conferencing services utilizing [ORTC], as well as proprietary additions to [WEBRTC].

3. New Use Cases

Several new uses cases relate to scenarios that cannot be supported in [WEBRTC] without extensions.

3.1 File Sharing

Participants in a mesh exchange large files without disruption to audio/video sessions. It is also possible for a participant to send a large file to a user who is not currently online. Supporting this use case adds the following requirements:

Requirement ID Description
N10 It must be possible for the user agent to initiate transfer of a large file with a single API operation.
N11 The application must be able to signal backpressure (flow control) when receiving data. It must also receive a backpressure signal when sending data.
N12 It must be possible for the user agent to transfer data utilizing a congestion control algorithm that does not compete aggressively with audio/video communications.
N13 It must be possible to support data exchange in a web, service, or shared worker.
N24 CSP support for WebRTC.

References:

  1. Mailing list discussion
  2. Mailing list discussion

3.2 Internet of Things

An IoT sensor maintains a long-term connection and seeks to minimize power consumption. Some of the sensor’s data may need to be sent reliable and ordered while other sensors may provide data that can be sent unreliable and unordered or in a partially reliable manner. This use case adds the following requirements:

Requirement ID Description
N14 The application must be able to minimize ICE connectivity checks.
N15 The application must be able to control aspects of the data transport (e.g. set the SCTP heartbeat interval or turn it off), RTO values, etc.
N16 It must be possible to send arbitrary data reliable, unreliable or partially reliable with a specific maximum number of retransmissions or a specific maximum timeout.
N17 It must be possible to send arbitrary data ordered or unordered.
N24 CSP support for WebRTC.

Reference

Mailing list discussion

3.3 Funny Hats

A communications service that manipulates captured media prior to encoding and after decoding to provide effects including:

  1. Captioning
  2. Transcription
  3. Language translation
  4. Funny hats
  5. Background removal or blurring
  6. In-browser compositing
  7. Voice effects
  8. Stress detection

This use case requires manipulation of raw media from both local and remote sources. Since media processing can be CPU intensive, enabling it to occur off the main thread is important, as is enabling the processing to take advantage of the GPU. This use case adds the following requirements:

Requirement ID Description
N18 The application must be able to obtain raw media from the capture device in desired formats.
N19 The application must be able to insert processed frames into the outgoing media path.
N20 The application must be able to obtain decoded media from the remote party.
N21 It must be possible to efficiently share media between the main thread and worker threads.
N22 It must be possible to do efficient media manipulation in worker threads by utilizing the GPU.
N24 CSP support for WebRTC.

References:

  1. Mailing list discussion
  2. Mailing list discussion
  3. Sharper Image Research

3.4 Machine Learning

In a web game called “NameTheBird.com” participants use their devices to provide audio and video observations of birds to the service along with identifications for training purposes, allowing the service to identify birds from the provided audio and video and returning this information to the users in real-time.

The web application has a site specific federated learning-based classifier for contextual object detection, user intent prediction and media manipulation, allowing it to augment the streams it receives and inject identifying or other supplemental information into the streams sent or received.

The shared classification models are trained on the birds found by the participants and are based on the feedback of the participants. Each device client updates of the model are up-streamed to a shared model server that pushes updates of the global model to the clients.

Implementation outline:

  1. Originating media (raw) streams are cloned for inference and training purposes, denoted “inference stream” and “training stream”, with the inference stream also being the media stream shared with peer(s). The cloning can occur any time during a session.
  2. Inference stream: A web site specific classifier acts on the raw inference stream, with the result used to guide a custom encoder in the sender device and send metadata to the server and peer devices outside the media stream. The encoder adds proper augmentation, e.g. sign with “name this bird” hovering over the enlarged bird in case of video enrichment, or enhanced bird song if audio.
  3. Training stream: Model in training classifies the raw data and evaluate the classification using user feedback, said feedback loop being web site specific. The evaluation may be “online” or “offline”, offline meaning the training is done at a later stage on the recorded encoded media set.
  4. Both inference stream and training streams may use payload protection depending on trust model on compute resources for optional intermedia server side of app.
  5. Both inference stream and training streams use transport object for communicating with peers or servers, the communication in some cases can be a site specific QUIC based transport solution, in others RTP based.

This use case adds the following requirements:

Requirement ID Description
N18 The application must be able to obtain raw media from the capture device in desired formats.
N19 The application must be able to insert processed frames into the outgoing media path.
N20 The application must be able to obtain decoded media from the remote party.
N21 It must be possible to efficiently share media between the main thread and worker threads.
N22 It must be possible to do efficient media manipulation in worker threads by utilizing the GPU.
N24 CSP support for WebRTC.

3.5 Virtual Reality Gaming

A virtual reality gaming service utilizing a centralized conferencing server wants to synchronize data with media, using an existing Selective Forwarding Unit (SFU) to distribute the data. This use case adds the following requirements:

Requirement ID Description
N23 The user agent must be able to send data synchronized with audio and video.
N24 CSP support for WebRTC.

References:

Mailing list discussion

3.6 Don't Pown My Video Conferencing

Cloud video conferencing systems have no need to be able to access the cleartext media and text flowing through their servers. Some of these conferencing services desire to be able to promote trust by explicitly showing they do not have access to contents of their users' calls. They are trusted to connect the right people to the conference and to route the packets but they are not trusted to access the audio and video media or text in the call.

Solutions to this problem fall into two major categories: one where the JavaScript comes from a source trusted to see the media contents, and one where it does not.

3.6.1 Untrusted JavaScript Cloud Conferencing

There are many cases where a system such as WebEx is trusted to connect the members of a conference but has no need to access the contents of the conference. This is true of the majority of conferencing systems on the web today. Just to highlight the scope of this requirement, there are more minutes of WebRTC that are used in conferences where the servers have no need to access the contents (e.g. where audio is forwarded rather than mixed) than any other use of WebRTC audio by orders of magnitude. This is one of the primary use case for WebRTC audio and accounts for billions of minutes per month of potential use of WebRTC.

In this use case, the JavaScript comes from the operator of the conference bridge. The isolated media features of WebRTC can prevent the JavaScript from accessing the media and the identity features are used to provide a user interface that allows the user to know it connected to the correct conference. The goal is for the end users to be able to see the contents, but the web service that provides the JS and the media switching bridges and Selective Forwarding Units (SFUs) cannot access the contents (audio, video, text). The browser may choose to reveal some metadata, such as the audio power level, to the media server, in order to support functions like speaker switching.

A possible solution this problem is the browser to negotiate end-to-end encryption keys which are not revealed to the JavaScript.

Security requirements relating to this use case are discussed in [MLS-ARCH], and include the following:

Requirement ID Description
N25 Only current group members can receive media or text sent to the group.
N26 A group member cannot send media or text that appears to be from another group member.
N27 The conference server must not have access to cleartext media or text or to the identity of group members.
N28 Perfect Forward Secrecy (FCS): access to encrypted traffic as well as all current keying material does not compromise the secrecy of media or text older than the oldest key of a compromised client.
N29 Post Compromise Security (PCS). Protection against past or future device compromise.

4. Requirements Summary

This section summarizes the requirements arising from the use-cases included in this document.

Requirement ID Description
N01 The user agent can control candidate gathering and pruning, limiting the networks on which candidates are gathered, the types of candidates, etc.
N02 The user agent must be capable of establishing multiple connections to peers without generating a separate configuration ("offer") for each connection prior to establishment.
N03 Congestion control must be able to manage audio quality and latency in a fair manner between multiple connections.
N04 The ICE agent must be able to maintain multiple candidate pairs and move traffic between them.
N05 The ICE agent must be able to take the network cost into account when considering re-routing.
N06 The user agent must be able to encode and decode video utilizing temporal scalability and (if supported by the chosen codec) spatial scalability.
N07 A user agent can receive audio/video without requiring construction of a corresponding sender object.
N08 It is possible to select the sending and/or receiving codec as well as rtcp parameters and header extensions without negotiation.
N09 The user agent must be able to control robustness (RTX, RED, FEC) applied to individual simulcast and SVC layers.
N10 It must be possible for the user agent to initiate transfer of a large file with a single API operation.
N11 The application must be able to signal backpressure (flow control) when receiving data. It must also receive a backpressure signal when sending data.
N12 It must be possible for the user agent to transfer data utilizing a congestion control algorithm that does not compete aggressively with audio/video communications.
N13 It must be possible to support data exchange in a web, service, or shared worker.
N14 The application must be able to minimize ICE connectivity checks.
N15 The application must be able to control aspects of the data transport (e.g. set the SCTP heartbeat interval or turn it off), RTO values, etc.
N16 It must be possible to send arbitrary data reliable, unreliable or partially reliable with a specific maximum number of retransmissions or a specific maximum timeout.
N17 It must be possible to send arbitrary data ordered or unordered.
N18 The application must be able to obtain raw media from the capture device in desired formats.
N19 The application must be able to insert processed frames into the outgoing media path.
N20 The application must be able to obtain decoded media from the remote party.
N21 It must be possible to efficiently share media between the main thread and worker threads.
N22 It must be possible to do efficient media manipulation in worker threads by utilizing the GPU.
N23 The user agent must be able to send data synchronized with audio and video.
N24 CSP support for WebRTC.
N25 Only current group members can receive media or text sent to the group.
N26 A group member cannot send media or text that appears to be from another group member.
N27 The conference server must not have access to cleartext media or text or to the identity of group members.
N28 Perfect Forward Secrecy (FCS): access to encrypted traffic as well as all current keying material does not compromise the secrecy of media or text older than the oldest key of a compromised client.
N29 Post Compromise Security (PCS). Protection against past or future device compromise.

A. References

A.1 Informative references

[MLS-ARCH]
The Messaging Layer Security (MLS) Architecture. E. Omara; B. Beurdouche; E. Rescorla; S. Inguva; A. Kwon; A. Duric. IETF. 11 March 2019. Internet Draft (work in progress). URL: https://tools.ietf.org/html/draft-ietf-mls-architecture
[ORTC]
Object RTC (ORTC) API for WebRTC. Robin Raymond. W3C. 03 October 2018 (work in progress). URL: https://w3c.github.io/ortc/
[RFC7478]
Web Real-Time Communication Use Cases and Requirements. C. Holmberg; S. Hakansson; G. Eriksson. IETF. March 2015. Informational. URL: https://tools.ietf.org/html/rfc7478
[WEBRTC]
WebRTC 1.0: Real-Time Communication Between Browsers. Cullen Jennings; Henrik Boström; Jan-Ivar Bruaroey; Adam Bergkvist; Daniel Burnett; Anant Narayanan; Bernard Aboba; Taylor Brandstetter. W3C. 5 November 2020. W3C Candidate Recommendation. URL: https://www.w3.org/TR/webrtc/