W3C

– DRAFT –
WebRTC / Media / Audio WG joint meeting

26 October 2021

Attendees

Present
AlbrechtSchwarz, BernardAboba, CarineBournez, ChenCheng, ChrisChunningham, ChrisLilley, CullenJennings, dom, EerorHakkinen, EladAlon, Eric, FlorentCastelli, FrancoisDaoust, GregFreedman, HaraldAlvestrand, HiroshiKajihata, hober, HongchanChoi, JamesCraig, JanIvarBruaroey, jcraig, JungkeeSong, Kajihata, lideping, lilin, Mark_Foltz, MarkWatson, MattParaid, MichelBuffa, PaulAdenot, PhilippeMilot, PiersOhanlon, RandellJesup, RobinRaymond, ShinNagata, SongXu, SteveLee, TakioYamaoka, TimPanton, Tove, TuukaToivonen, Varun, XiaohanWang, YanChangQing, Youenn, youenn_fablet, ZhangLei
Regrets
-
Chair
-
Scribe
cpn, tidoust

Meeting minutes

Slideset: https://lists.w3.org/Archives/Public/www-archive/2021Oct/att-0012/MEDIA-WEBRTC-10-26-2021.pdf

Introduction

Bernard: Welcome to the joint meeting
… Slides are at https://docs.google.com/presentation/d/1XKNdYR0JWTtO1EIGu_sQy5rqfXMorrCKpu0UTX6kizQ
… We'll talk about next generation APIs and areas to work on
… We'll have time at the end for wrap-up and next steps
… We're not trying to solve problems in this meeting, just identify problems
… There may be issues we don't realise we have that have not been filed yet

Next Generation Media APIs

Bernard: Streaming and real-time communications evolved in silos
… Real time streaming could only support a modest audience ~ 100s
… The pandemic has been pivotal, transformation, user-driven innovation
… What have you observed?
… One application that summarises some of the trends is "together mode" that superimposes participants in a virtual experience
… With large gatherings restricted, the goal was to include fans virtually
… The video was processed for background removal and composited
… Developers don't want to choose between streaming and real-time communication silos
… Next-gen media APIs provide access through a single set of tools
… Low-level building blocks such as capture, encode/decode, transport, and rendering
… [Shows some of the APIs]
… Capture APIs are in development in the WebRTC WG, encode/decode in the Media WG
… Web Transport and WHATWG Streams, WASM
… The APIs support multi-threading. Support for transferable media stream tracks was added
… Things not on the list, but useful additions, include JS libraries for containerisation / decontainerisation
… Some APIs are modelled on streams, others can be wrapped in a stream-like API
… Allows use of special effects in the pipeline. MediaCaptureTransform API to convert a track to a stream of A/V frames
… The transport packetized and takes care of FEC
… Encoded chunks are received, decoded, and presented with WebGL or WebGPU
… [Pipeline model example]
… Typically this runs in a Worker. The pipeThrough functions are implemented as TransformStreams
… Can use any transport, WebTransport or WebSocket, can choose reliable and orderd, reliable and unordered, unreliable and unordered
… Does this story hang together? What are we missing?
… One thing I noticed recently, we spent time talking about workers
… There is no meta-spec describing the overall requirements for Worker support across these APIs
… There's no guarantee that the browser will support all the APIs needed by an app
… No browser supports both MSE v2 and RTCDataChannel in Workers
… So you can't fully make use of Workers
… Could say it's an implementation issue, but we're spending time figuring out the fundamental tools needed to support media apps
… We may not have discovered the full set of tools, and there's no single spec that can do that
… Another issue is testing. WebRTC is difficult to test using WPT, as the tests use multiple endpoints
… In WebRTC and WebTransport WGs, we're extending WPT to add an echo test server
… You can test some aspects of protocol performance with this
… WebCodecs may also benefit from an echo testing framework
… Another issue is performance in WHATWG streams
… We don't currently have performance criteria, data on performance, or processes to bring discussion to closure
… For example, a process to have joint discussion with WHATWG on streams
… Some consider streams to be suitable until we prove otherwise, some think the opposite
… We don't have a single place to discuss this all together, between W3C WGs and WHATWG
… In theory, we have client/server and P2P transport, supporting multiple modes
… But why does it feel like there are gaps when we build an app?
… Apps using RTCDataChannel replace congestion control to control latency
… What if media is sent by the browser, e.g, video ingest in the browser or a video conference?
… How do the transports usable by WebCodecs compare with WebRTC?
… Live WebRTC was actually modified by YouTube Live to get better quality for video ingestion
… Because it optimises latency over video quality, so the encoder bitrate target is adjusted
… WebRTC probes and can restore drop layers
… After a loss, TCP additively increases, as does SCTP, but this produces a delay to recover quality
… RTMP/ RTCDataChannel work well for video upload, but not so well for video conferencing
… So would be beneficial to have access to a low latency transport, RTP
… On congestion control, there's an issue with the interaction, average bitrate target overshoot
… You'll lose packets if you can't build a queue, need to re-send a keyframe
… You could lower the avg bitrate target, reduce resolution of keyframe, but you'll still get a bandwidth spike
… Some bigger things (bigger because they involve an ecosystem)...
… Selective Forwarding Units are the basis of real-time streaming. Make it difficult to include end to end security
… Overall, we need a next generation of SFUs to go with the next-gen APIs
… APIs aren't enough, we need protocol standards for how audio and video are carried
… On the streaming side, we've struggled to replace RTMP which doesn't support next-gen codecs
… Many contenders: SRT, WHIP, RUSH
… A paper "The QUIC fix for optimal video streaming" looked at the value of differential reliability
… that's eliminating HoL blocking, keyframes vs delat frames, and discardable frames with lower reliability to avoid holding up keyframes
… Another area that could be a big missing piece is the combination of WebCodecs and content protection
… Content protection is associated with containerisation. WebCodecs doesn't use containerised media

MarkFoltz: In the pipeline, I didn't see a step for executing ML models. Have you looked at the Web ML WG, compatible with WHATWG streams?

Bernard: I put that into the effects block. What is the performance like?
… Arguments on main vs worker thread. Great question is whether we get the performance we need in this model
… Some use cases have a lot of users at once. Performance issues often aren't surfaced in the WGs handling media
… Want to make this stuff possible

Piers: On low-latency, the APIs for measureing throughput accurately to allow ABR algorithms to work properly
… The stream APIs don't provide timestamping, so needs to be done at the user level, so there's a lack of facilities for decent perofrmance measurements

Bernard: In some of the ingestion proposals, implementers are integrating directly with QUIC, so they get that form the QUIC stack, so they're avoiding the web APIs and protocols fro that reason

Piers: Timestamping of data delivery currently must be done by getting time of day, but could be at a lower level
… Especially with chunked transfer deliver, where there's potentially gaps in delivery

Bernard: It's good question about how you can build the congestion control today, the answer is probably no

WebRTC and WebCodecs

Harald: I'm trying to get a feel for where we are and what's moving
… The send-side and receive-side are approximately equivalent
… When yuo want to send data, in WebRTC you create a MediaStreamTrack connected to a camera or microphone
… There are multiple feedback paths in the RTCRtpSender
… There's feedback from the transmitter that modifies continually the sending bitrate of the codec
… The whole thing is designed to keep the video rolling, freezing is not acceptable
… We did insert the ability to have a MediaStreamTrack processed in JS, using the breakout box
… You connect a track to a processor and get out a stream fo video frames
… That's perfect for feeding to a WebCodecs endocer, then you packetize and send
… But that's not WebRTC. The breakout box is a proposal in the WG, we haven't come to an agreement to accept it
… Should the stream of video frames be visible on the main thread or not?
… More people think that it shouldn't than that it should, which is the opposite of the WebCodecs conclusion
… Insertable Streams were originally designed for inserting a stream into an RTCRtpSender
… You get out a stream of encoded frames, not the same as you get from WebCodecs
… There's a number of creating things people want to do, e.g., encode once, send to many
… Those thigns don't work. It looks like it works but it doesn't
… Why do integration with realtime and stored video streams differ? For realtime, need to keep the media flowing
… adjust according to bandwidth. SVC allows you to drop part of the stream, create a less power-hungry stream without having to ask the sender to change it
… The opposite is to deliver stored media, e.g., YouTube. If you get congesting, you might switch to a different source encoding
… Encoding speed doesn't matter too much. When you have bandwidth, you can catch up
… People tolerate stalls in the video
… SVC not so useful in this context
… Some desirable patterns we want to be able to do. Connect the incoming stream to the outgoing encoded stream
… Sending to multiple destinations, or some to WebTransport or RTP transport as needed
… We want to be able to use all the tools with all the other tools, but we can't
… We can only use according to how they were initially designed
… Design choices may not be optimal for all circumstances. I get asked why not let the app control the congestion control
… People want to experiment
… SDP is an old and clunky language for describing media streams across the network
… But it has expressive power. We'd like to make sure that (a) people who don't need to deal with SDP, they don't have to
… and (b) if something is possible in SDP, it's still possible with new interfaces
… Some controls have to be reacted to immediately, where hopping to JS, asking the user what to do, can lead to suboptimal responses
… But in other circumstances, asking the user is what we want to do
… We haven't started the investigation to figure out what we need to do
… In summary, WebCodecs and WebRTC are powerful tools. Some things fit together, and some don't
… So we need to learn more

Cullen: I agree things don't fit together
… Back-propagation of parameters, changing bandwidth etc, we talked about in WebRTC
… Do video coding in the camera before it gets to the camera. Is that someting we look at fixing?

Harald: Yes, make components, not systems

Cullen: Some of these things we imagined doing, but didn't in order to ship quickly. So now we need to go back

Harald: I looked at the ORTC effort. The linkage between codec and transmission hadn't been taken apart in that effort
… YAGNI (you aren't gonna need it)
… Use cases drive designs

Cullen: I think your use cases of sending the video you receive is interesting, as well as back-propagating bandwidth into the pipeline

Bernard: WHATWG streams has an idea of backpressure, but that may be different to what we mean by backpressure

Cullen: I mean things that a scalable codec would want, bandiwdth, resolutions
… A single scene reprented by multiple video flows
… Similar with audio

TimPanton: We shoulnd't neglect peer to peer applications
… It would be a mistake to focus too much on server-centric APIs
… I want to keep the symmetry of WebRTC and do P2P stuff without processing in the middle. That'll become more necessary

MarkWatson: To add to the idea of backchannel information, I'd like to add high resolution timing information when packets are received
… You need to carefully manage and understand what's happening

Harald: The video frame / audio chunk timestamp, which is the presentation intent, the other is timestamps on the way
… The former need to be set by the originator but never change
… Timestamps on the way are an important consideration too

Audio Challenges

Audio Challenges slides

padenot: Going to talk about some of the audio challenges. First, explain the main problems that audio has and not video.
… Then identify problems and proposals.
… Main problem is that audio is hard real-time. It should never fail.
… Video, you have an event, 60 times per second. In audio, event every ms or every 2ms.
… Considering computers are likely under some load, audio data, which again should never glitch, should only touch real-time threads.
… That's for PCM (decoded audio).
… Any other setup will lead to resilience issue. E.g. if some code is delayed due to GC or the like.
… There is a proposal for a push model for the audio.
… All the audio in computers works as a pull model.
… It follows that we need to insert a buffer between the push and pull.
… This is more or less what we have now. What is important is that we should be able to have a media stream, connect it to the real-time thread
… and the real-time thread needs to know that there are missing bytes at the input.
… In the audio worklet today, this is missing.
… A strawman proposal: "process" method with inputs, output and params as parameters.
… If we say that we can make the size of the buffers known, then we can tell when we run into buffer underruns.
… Useful in different scenarios, not only in WebRTC.

[ Slide 48 ]

padenot: Thought experiment with Chris Cunningham recently. Works well with WebCodecs. Low-latency. SharedArryBuffer being used.
… A system like this is in production today in Gecko.
… Extremely high resiliency for audio. Perceptually, we think it is better than the opposite.
… The question is: what pulls? As the API stands today, you have to pull from a non real-time thread.

Harald: When we are pre-jitter buffer, if we want the processing, we need something that can access the jitter buffer between the input and the audio buffer.

padenot: I haven't found any advantage of not doing it only on the real-time thread.

Harald: The only problem is security. Some kind of guard against occupying the CPU.

padenot: Something we had to implement for AudioWorklet.
… We do it explicitly in Gecko for security reason.

Youenn: You tried to look at MediaStreamTrackProcessor, what is available in Web Audio. You found some potential gap and this new API, you think that you close the gap.
… Is that correct?

padenot: Yes, I found one gap. Can't distinguish between buffer silence and buffer underrun.

Youenn: There were mentions about timestamps in the GitHub discussion. Any idea?

padenot: Clock domain traversal is the fundamental problem.
… Certain number of frames per second in the clock domain of the sending device.
… When you play out one second of someone that has recorded 1s of audio data on their computer, you need to reconcile for the drift.
… The timestamps allow you to measure how much faster or slower your computer is running.
… Different ways to reconcile, e.g. looking at and adjusting the jitter buffer.
… When the lengths of the data packets don't match, then you can identify precisely where the problem is.
… This logic is needed without all that has been discussed because you can already connect devices with different clock domains today.
… Re. A/V sync, you can match a frame with some time in the audio stream. Now, it's important to understand output latency. It can be significant when you're using e.g. Bluetooth devices.
… For display, it's important that you would delay your video rendering to account for the audio latency.

fluffy: My point is that you need both. You have to deal with audio and video.

padenot: I agree completely.

Next Generation Audio Codecs

Harald: This is me trying to exercise my imagination as it's not something that I know the most about.

[ Slide 50 ]

Harald: We do have OPUS. Gives reasonable bandwidth.

[ Slide 51 ]

Harald: Other codecs may actually work better.
… You could for instance have meaning-based encodings, using ML, e.g. with shape of the mouth, etc, or use text-to-speech or speech-to-text.

[ Slide 52 ]

Harald: Tons of things you can imagine. How do you get these deployed?
… Obtain the licenses, then you run the experiment, then you integrate it into open source codebase, then you push that to make it available on all platforms to make sure that you have interoperability on it.
… And then you win.
… It would be great if you could start winning at step 2.

[ Slide 53 ]

Harald: If we can get performant interfaces to raw and encoded data and precise timing guarantees. Then the only challenges is to minimize underruns.
… We can imagine that all the rest is just a codec. A typical deployment model for this kind of codec should be that we develop it as WASM, deploy it as part of the page, experiment with it before we can integrate when we have proved the value
… as opposed to integrating it to start with.
… The reason why we want codecs as separable components is that, if we want to deploy a new way of doing audio decoding/encoding, then we need these raw and performant interfaces.
… That's my vision on new codecs

Tim: That seems to me to be a big company vision. Without access to IPR, hard to compete in that environment.
… One of the joys of OPUS is that we compete on common grounds.
… Closed IPR for new codecs.

Harald: It is a challenge.
… It also means that "two men in a garage" can create a codec without having to integrate in Chrome or Mozilla.

Youenn: I was not sure about the scope here. New codecs for WebRTC? Or in general?
… In general, you could use WebTransport.
… One good thing about targeting audio first is that packetization is much easier to solve than with video.
… If we want to open the box for WebRTC, starting with audio seems logical to me.

Harald: When we did the breakout box for WebRTC, we started with audio, and extended it to cover video.
… And now we're cycling back to do video-only.

Youenn: For me, these are orthogonals. WebRTC Encoded Streams being a third one.

WebCodecs Challenges

[ Slide 39 ]

chcunningham: First issue I wanted to raise is containers.

[ Slide 40 ]

chcunningham: Folks come and ask how to do muxing/demuxing.
… Answers is to go and find a JS/WASM lib. I actually like that answer, except that finding a library is a challenger.
… Been using MP4Box for our demos, but there are other container formats.
… I'm going to measure how configurable using libavformat is, how heavy that is.

[ Slide 41 ]

chcunningham: Another issue that is relevant is reclamation. The user agent can reclaim a codec for foreground apps. Yielding the codec, as done in Chrome today.
… If we imagine a future where the video element is implemented in JS.
… The challenge with this is identifying which apps are in the foreground.
… Heuristics in Chrome to detect this.
… There are apps like movie production apps. Long encode job, so you might leave that task in the background. Would not be great to come back and realize that zero progress was made.
… We don't have a great solution to these problems right now. You should expect some proposal in the coming quarter. Taking feedback!

[ Slide 42 ]

chcunningham: Finally, we have content protection. E.g. live streaming of sports events.
… Past discussion on whether this should be EME extension, SFrame.
… Not a WebRTC expert, but my understanding is that SFrame was introduced to solve E2E encryption.
… There were some discussion on applying SFrame to JS.
… Should SFrame be part of WebCodecs? My thought is that it shouldn't.
… EME protects the content more rigorously. Even if you use SFrame, you have a bunch of other things to look at, which EME covers.
… It would be crazy to re-invent all of that.
… Especially since most folks will want to depend on the same server-side infrastructure.
… Also, the whole thing about JS not being trusted seems weird to me.
… If you cannot trust your JS, you basically cannot do things such as banking.
… I just wanted to call that out. I'm not a WebRTC expert. To the extent that folks are reasoning on SFrame, we should reconcile our views.

dom: Re. untrusted JS, you don't want or need JS access to the media stream. For example, imagine we're running a conferencing system for a high-level transaction, we don't want the company that provides the conferencing service to be able to access the decoded content.
… Same as for EME where you're protecting against the end user.
… SFrame with browser-managed key exchange system would go some way to addressing some envisioned scenarios.

chcunningham: Does SFrame need additional protection then? That's my fear. If you have access to decoded frames.

dom: The ultimate model for SFrame is that the JS wouldn't be able to do that, because they wouldn't have access to the key that the UA is using.

chcunningham: The concern is that, if we acknowledge that SFrame has this gap. If you want to solve that gap in a reasonable timeline without re-doing the whole exercise, you want to lean on EME.
… It may be worth wondering whether SFrame is useful at all. If you have EME, what else do you need?

Tim: The ability to remove potential crypto footguns. The class of mistakes that devs make. Not allowing errors to be made is valuable, which is what SFrame provides.
… The other interesting bit is integration with SFUs.
… I have the feeling that EME may have side effects, e.g. when a participant leaves.

chcunningham: Some provision for key rotation in EME, I think, but we would get out of my expertise here.
… I just haven't seen anything that couldn't be solved by EME now, so nothing that justifies inventing something else.

jib: SFrame is exposed in Encoded MediaStreamTransform. The goal there is to protect the keys from JS.
… I think we should start from use cases. We should first go to people developing the SFrame protocol if we have specific needs.
… Mozilla has a different proposal in that space by the way.

chcunningham: The idea of protecting the keys is worthwhile. I do think that there is a gap in protecting the media. If you're worried about the keys, I think you should be worried about the decoded media as well.
… It's a big endeavor to protect media. I think we should try to avoid building a second mechanism for that.

jib: Re. keys, it's more to avoid them being in process context.

MarkWatson: I'm sort of confused. EME currently has registries where it refers to different container formats. Nothing in EME for RTP as container format. If you're going to introduce that, then SFrame may be the right mechanism there.
… One of the things with WebCodecs is: what is the container? What is the encryption?

Bernard: To be clear, SFrame is an frame encryption mechanism, independent of RTP

MarkWatson: Then it may be the right frame encryption mechanism to be used in EME.

chcunningham: That could make sense. I just want to avoid re-creating things that already exist.
… There are no containers at the WebCodecs level.

Richard: Same place as Mark. You can view EME as E2E encryption itself. And SFrame could be used for frame encryption.

fluffy: If you look at the use cases, EME is not a good match.
… We looked at it in the broader context.
… Unlike a streaming scenarios, lots of people and lots of keys.
… Lots of security requirements.
… Doesn't match how keys are distributed. I don't think that EME works very well at all.

Bernard: I think that the use case here is streaming with WebCodecs as decoder.

fluffy: I get that, but then we need a more generalized mechanism.
… The part of the requirements that is similar is not having access to the decoded media once decoded

Bernard: Any feedback on how to move forward?

chcunningham: Cullen, if you can help us out with providing some links on investigation that you may have done. Next steps could be extensions that may be worth considering and how SFrame could be used with EME.

MarkWatson: If the use case is for streaming, then you're going to have some support Common Encryption somehow.

Bernard: Any other problems that people are interested in raising?

Harald: I would definitely want to follow up more on finding out where exactly we interface with real-time audio with protected content.
… I think we need to have a safe interface there.

Youenn: Getting the input from Web audio people is very useful. I think we should make decisions based on what audio folks are doing.

??: Where can group members follow up?

Harald: I don't know. It should belong in either WebCodecs or Web Audio. Probably WebCodecs.

jib: Further input on the application of creating readable streams at the video frame. The lifetime model uses a close method that requires apps to explicitly call close, which triggers headaches as you're not necessarily aware of how many frames have been buffered.
… Please join us requesting features to WHATWG to make sure that we can have real-time streams.

Bernard: Yes, I think that we should have a mechanism on following up with WHATWG. It's not clear to me that we have a process in place.

jib: I understand that we're going to have a meeting with WHATWG.

cpn: A quick note on container investigation. I'm certainly interested in that. Primarily a Media WG thing.

Minutes manually created (not a transcript), formatted by scribe.perl version 149 (Tue Oct 12 21:11:27 2021 UTC).

Diagnostics

Succeeded: s/@@/https://docs.google.com/presentation/d/1XKNdYR0JWTtO1EIGu_sQy5rqfXMorrCKpu0UTX6kizQ

Succeeded: s/We'll/... We'll/

Succeeded: s/??1/SRT

Succeeded: s/fro/for

Succeeded: s/conntetced/connected

Succeeded: s/askthe/ask the

Maybe present: ??, Bernard, chcunningham, cpn, Cullen, fluffy, Harald, jib, MarkFoltz, padenot, Piers, Richard, Tim