W3C

– DRAFT –
WebRTC WG / Media & Entertainment IG Joint Meeting

16 October 2020

Attendees

Present
Barbara_Hochgesang, Bernard_Aboba, Carine, Carine_Bournez, Chris_Cunningham, Chris_Needham, Chris_Wendt, Cyril_Concolato, dom, Dominique_Hazael-Massieux, Dr_Alex, Eero Häkkinen, Eero_Hakkinen, Elad_Alon, Eric, eric_carlson_, Florent Castelli, Florent_Castell, Florent_Castelli, Franco_Ghilardi, Francois_Daoust, Germain Souquet, Germain_Souquet, Geun-Hyung_Kim, Guido_Urdaneta, Harald_Alvestrand, Jan-Ivar Bruaroey, Jan-Ivar_Bruaroey, Jeff Jaffe, Jeff_Jaffe, Jianjun, jib, Kaz_Ashimura, Lei_Zhai, Li_Lin, Louay_Bassbouss, Nigel_Megitt, Palak, Peng_Liu, Rijubrata Bhaumik, Rijubrata_Bhaumik, Sudeep_Divakaran, Takeshi_Homma, Takio_Yamaoka, Tim_Panton, Tomoaki_Mizushima, Tuukka_Toivonen, Yasser_Syed, Youenn, Youenn_Fablet
Regrets
-
Chair
Chris_Needham
Scribe
tidoust

Meeting minutes

<kaz> Char: Chris

cpn: Joint meeting between the Media & Entertainment IG and WebRTC WG. Presentation today will be driven mostly by WebRTC people.

<eladalon> Elad ALon

<eladalon> Elad Alon

Bernard: The WebRTC WG just rechartered.
… We don't define codecs and network protocols. We do define API functions though.
… [showing the list of core WebRTC deliverables]
… We also have a whole bunch of specification around media capture and record.
… Media Capture and Streams, etc.
… State of Capture and Output deliverables: most or all of them have been implemented in at least one browser.
… Several specs have gone to CR. But some of them have remained as Working Drafts for years.
… Privacy is a major concern, we spend a lot of time on this these days.
… Example of the browser picker model for media capture under development, driven by Jan-Ivar.
… Question is whether we should do the work, or whether there are people somewhere (not in the WebRTC) that could be motivated to do the work on the media specs.

jib: [shows a MediaStream model diagram and explains sources and sinks]
… Importantly, all of the sources produce MediaStreamTrack, and they can be linked to a sink, which can be an Element, a MediaRecorder, ImageCapture, Web Audio, or a different peer.
… Anything below the dotted line is networking. Everything above the line is what we want to talk about today.

cyril: No connexion with Media Source Extensions?

jib: only if you want to get the output of MSE as a MediaStreamTrack.
… But MSE cannot be a sink.
… MediaStreamTrack is opaque. To do any kind of processing for videos, we get a lot of requests for that, you need access to raw bytes. Harald will talk about that.
… mediacapture-main has the main specs. The spec is in final CR, mostly minor issues remaining. We just want to reduce fingerprinting.
… mediacapture-output has setSinkId and selectAudioOutput. Also in CR. Some implementation plans.
… mediacapture-from-element: "captureStream". WD. Minor issues, works across browsers. No recent activity.
… mediacapture-screen-share: getDisplayMedia.
… mediacapture-image: takePhoto, more camera constraints. Work picked up in 2020 again for pan, tilt & zoom constraint. Implemented in Chrome, some interest in Safari I believe. Not in Firefox.
… mediacapture-record: MediaRecorder. Low activity, but some implementation work ongoing, including support in Safari Tech Preview.
… mediacapture-extension: e.g. advanced audio channel layout
… I left out a couple of other repos, including one to create fake devices to test APIs.
… A lot of the media specs have been dormant, or somewhat dormant.
… Recently, there has been interest in renewing some of the work.

Bernard: Machine Learning is increasingly important. One example is background removal. Some experiments on constructed environments.
… There was an ML workshop. How do you get data efficiently to run ML models.
… There is a VideoTrackReader API in WebCodecs. And Harald will be talking about Insertable Streams for WebRTC
… About WebCodecs, incubated in WICG. You can get access to raw media through it. Low-level access. Early stages, some known limitations. Not based on WHATWG Streams. No support for advanced video, content protection.

Harald: Talking about Insertable Streams for Raw Media. "Funny hats" are the buzz word to understand what the key scenario is.
… The goal is to open MediaStreamTrack to this kind of processing, keep it simple, keep it easy.
… [shows RTC media flow]
… We're trying to open up the MediaStreamTrack so you can get access to the raw data, where most of the time JS does not want to go, because JS usually does not have the time.
… It turns out that this is not as true as it used to be. People are using JS/WASM to implement funny hats.
… I call this a breakout block. [example from ancient times]
… [shows a code example to explain the principles to add a Moustache]
… Two steps. Get a processing track from the video track. Then create a transformer to transform video frames, and then plug things together.
… Stage 3 is where we separate the boxes apart completely.
… The interesting thing about MediaStreamTrack is that they have two directions: media data flowing one way, and control data flowing the other way.
… When you break things apart, you want both the media and the feedback channels exposed to JS.
… We've been running some experimentation in Google to see whether that was feasible. Landing in Chrome 88. As of yesterday, there is something that vaguely ressembles a specification on a GitHub repo.
… The next step here is to propose this to the right WG. As of this morning, that was the WebRTC WG.
… If adopted, we can move the repo to the WG and file issues there.
… The principles are: keep it simple, keep it fast, keep it useful.
… We're working on demonstrating that it is fast.

Bernard: I wonder about the role of VideoTrackReader from WebCodecs in here.

Harald: In fact, we discovered that this could be a good fit with minor adjustments. We'll see.

Youenn: [scribe was behind, missed question]

<jib> Question was: does this also cover audio, since you can already use web audio to process audio

Harald: Video, we expect people to use very soon. If you can make that fast enough, that's glorious and necessary. For audio, once we have video running, we'll look at audio and compare with WebAudio to see if there are huge differences.
… WebAudio does not work well with things that want to process buffers with flexibility.

Youenn: It's interesting that WebAudio is "do not buffer". Here, it depends on what sinks you're using. For instance, if you plug into a MediaRecorder, you don't really care about the RTC aspects.

Harald: Yes. One question is where do you put buffering.
… Some folks believe that they could do the network adapter part in JS, others think the browser should do the work.

BarbaraH: The audio/video is a key topic. Quality of service is huge. How do you track and improve quality of service with the inner relationship between audio and video?
… Also, the evolution of speech and text-to-speech. Relation?
… Third question: You're talking about audio/video input. What about stylus input?

Harald: One thing that is critical for quality of service is handling of time. Everything we're dealing with has to be timestamped data.
… Everyone has been dealing with timestamp on data, and they know that things get really complex really fast.
… We do have to do work on actually figuring out where timestamps go, which clock they are on, and how do synchronize things when they are not on the same clock.
… The original JS runtime was not very good at time. Single thread. One of the critical pieces here is the ability to send streams to workers and get them off the main thread.
… Quality is also about measuring. I think we'll have to do a lot of experimentation to understand what to measure.

<dom> https://‌wicg.github.io/‌speech-api/

Harald: About text-to-speech and speech-to-text, there is an API that has been implemented.

jib: There's Speech Recognition and Speech synthetization but they are tied to the system. We think they should be redesigned to be tied to MediaStreamTrack.
… For stylus, I think that's out of scope for us. WebRTC WG is dealing with media data only.

<dom> [re speech recognition api, see also https://‌www.w3.org/‌2020/‌06/‌machine-learning-workshop/‌talks/‌wreck_a_nice_beach_in_the_browser_getting_the_browser_to_recognize_speech.html in terms of the input on the limitations of the current API]

Harald: That is one source of input that is also tied to time. The Web has a reasonable events mechanism, and it's important to keep the clocks compatible so that apps can say that "this and that happen at the exact same time". But mostly that's out of scope for this spec.

<Zakim> cpn, you wanted to ask about AudioWorklet

cpn: I've not yet looked at the Insertable Streams draft. I know that, in Web Audio, there is the AudioWorklet which gives you the flexibility to do processing off the main thread. I wonder about similarity.

Harald: The AudioWorklet was part of the inspiration. But TransferableStreams ended up being the main driver.
… That is an extremely powerful abstraction.
… I thought we should build the API around that abstraction.

jib: I think it's fair to say that there is overlap for audio.
… Bridging the gap between buffered media and RTC media is difficult.
… It is possible to do video processing today through a canvas, but not very performant, so this is all about improving performances.

kaz: Regarding the Speech API, I'm organizing a workshop on Speech updates and would like to have a breakout session in TPAC on the subject.
… I'll add it to the wiki.

jib: Moving on to "Capture HTML rendering". Web surfaces may be captured today but only if users pick them. However, sharing them carries significant risks that may not be understood by users, including active attacks on the same origin policy, because web surfaces give interactivity.
… Ironically, it is safer to share in a native app than in a Web app today.
… "Record this meeting" or "present a Google doc" are two examples of use cases.
… What if web pages stream themselves into a conference? The page would need to capture itself.
… The document would have to be totally isolated for that to be secure. We have policies today, but not opt-in, which doesn't work for us.
… A new policy would be needed.
… Just an idea for now. No spec or concrete plan.
… This would be more secure but still needs permission as rendering may still contain private info.
… Active attackers could harvest information quickly (through CSS), and it's hard to explain to users.
… Converting HTML to Video is a powerful paradigm, e.g. to browse remotely.
… Lower-level API than screen-sharing. [shows some API suggestions]
… Question is whether this would be of interest to anyone?

<Zakim> jeff, you wanted to ask whether the Cross-Origin-Embedder-Policy use case is being discussed with WebAppSec

jeff: I was wondering whether the set of use cases that led you to the new policy proposal has been discussed with WebAppSec.

jib: It has not. It's just an early idea.
… The requester wanted to add this to getUserMedia and we believe that's lower level.

Harald: There is a proposal that will be discussed in the WebRTC meeting. I've encouraged the proposers that they make sure that they present the use cases that they want to cover.

jib: And this is an area where we fear that this may go beyond the scope of the group.

Bernard: In the WebRTC WG charter, there are provisions that media specs could move out of the group, if they could be in better hands elsewhere and progress faster.
… We'd be happy to hear about proposals.

cpn: I would have guessed that all of the proponents for these technologies are already in the WebRTC WG.

Harald: The problem that we've been having in WebRTC is that some people with media competencies, if they come a day where the WebRTC WG actually discusses RTC plumbery, they don't come back.

cpn: Streaming media for consumption and having some possibility to do real-time processing on media streams suggests to me that looking at this from both perspectives would be useful. In the Media WG and Media & Entertainment IG, we're looking at next generation of MSE and media APIs, we'd be interested to get alignment down the road.

jeff: There may be some value in providing more details as to what is needed. What type of energy is needed?
… One type is people who have deep knowledge about WebRTC. Unlikely to find that out of the WebRTC WG.
… Another is people with media streaming background.
… Third type is test case writing, which may not require particular expertise.
… What kind of energy is needed?

Bernard: Just by addressing issues that people file. Security and privacy review of MediaRecorder. Very basic stuff that aren't making progress. Frustrating for team contacts and chairs who ping editors.

<peng> Is there any ongoing work to provide low latency capture for some special scenarios? e.g., fullscreen 3D games.

jib: I think it's also about focus. We're doing p2p and we're doing media capture and output. That was useful some years ago. WebRTC is funded in p2p connections. Ironically, in meetings such as this one, we're not doing p2p, but going through a server. It begs the question of whether you need to be an expert in RTC to tackle media.

cpn: Doing some work to look at the overall future of the media pipeline seems useful. That's something that the Media & Entertainment IG could help with. We could perhaps schedule another IG call.

Bernard: There will be a breakout session specifically on memory copies, and more generally on the media pipeline.
… There will also be discussions on WebTransport.

cpn: Thank you everyone for joining. Any final comments?

Minutes manually created (not a transcript), formatted by scribe.perl version 123 (Tue Sep 1 21:19:13 2020 UTC).

Diagnostics

Maybe present: BarbaraH, Bernard, cpn, cyril, Harald, jeff, kaz