Minutes of Joint WebRTC WG/SCCG Meeting at TPAC 2023

September 14, 2023

Agenda

Slideset: slides

Dynamic-switching between any display-surface types

Elad: This is under WebRTC WG patent policy. Several browsers let users change what tab is being shared dynamically (Chrome, Safari). Easier than restarting. Managed by surfaceSwitching option. This could be improved.

Audio, method and auto-pause…

Challenge #1: When a user dynamically changes, do they also want to change what audio is captured too? The current API assumes “audio from the start or never thereafter”. Not possible to start audio after start of capture.

cropTo() may be cropped to different targets, it belongs to the track. It is only meaningful on a tab.

Challenge #2: An app should not have to check what they are capturing before every invocation.

Challenge #3: You need some time to apply cropping if you’re suddenly changing to another tab. The app needs to be able to inspect the change before proceeding to send the frames over the network.

Proposal: Define a new event listener. When you register, if the user decides to share something else, it’s going to terminate the original track and fire a new track [stream]. This solves the audio problem too because the new stream could have audio. It’s backwards compatible.

Jan-Ivar: If we go this direction we don’t want to interpret any meaning to (???). Back to the general problem, since it is useful to switch, and is a user-driven action. I support that. I think we can do need something similar to the API shape. But if the original call only contains video:true and not audio:true we should not expose audio. As for APIs, we have existing APIs we could use, we could fire onaddtrack on the existing stream API. We might want to consider that.

Elad: I think we’re on the same page, but I think we should consider an event listener. We should decide if the user agent should do dynamic switching anyway. It should be clear that there are side-effects, but it does not need to be an event listener. We should not give audio if the app doesn’t want audio. If the app is interested in audio though they likely want it if the user starts sharing it. We would support it [starting new audio tracks after the initial gDM call].

Jan-Ivar: To separate permission and UX problems, there needs to be some signal, like the initial signal [audio:true]. I’m not sure we need a more specific opt-in than this.

Youenn: I think we’re on the same page in terms of allowing web pages to react to dynamic switching. In terms of API shape, initially I was thinking we should keep the same video track, because that is the easiest way for the pipeline and is in the best interest of the web developers. Their life should be easy. The issue with that is that there are new audio tracks. And in the future there may be multiple tracks, if we allow sharing multiple videos. So I think it is fine to allow new tracks. But we should probably engage with web developers on this. It might be more complex for them. But in terms of flexibility and in the future it might be easier. Media stream is very specific and I can see websites taking the initial stream and cloning it. But I think that… I think it’s fine, we can refine the API shape probably, we should have a discussion and iterate our options. But we should try to move forward.

Elad: There is an open question. We have the flexibility when we define this, so we can say what to do when you register or when you don’t register what we should do. So if backwards compatibility is a concern we can say if the app does not opt-in then you get the old behavior, even if that is a footgun, but you know that is your choice. I think we can enjoy the best of both worlds.

Youenn: That was the plan, right?

Elad: Preferably.

Jan-Ivar: I don’t think we should (???) [proceed with new API shape] without a TAG comment. Another working group had the same problem where they avoided this problem. This is interesting, keeping the old track undermines… (???) I think it was a mistake not to include a muted audio track in the event that they didn’t originally have audio.

Elad: This will be backwards compatible. So we can change.

Youenn: So it’s not really an event.

Elad: enableDynamicSwitching()? We can decide what it’s called later.

Jan-Ivar: But the content of a video track can change unless I opt in to this new mode, that means we’ll have to maintain two modes.

Elad: That’s more flexible though.

Jan-Ivar: I think one mode would be simpler.

Elad: We could try to deprecate the other mode, the old one probably, but I would be open to the possibility of using the old mode.

Harald: When we defined getUserMedia and later getDisplayMedia, it is stated in the spec that once we have a track the source cannot change. We debated this in the WG at the time and decided not to. Half a decade later, the argument for switching became so compelling, that they inserted - against my advice - to do switching, as a hack. Now we’re proposing a clean method to say: the source switched, the old track is no longer valid. This is consistent with what the spec has always said until we started disobeying it. So I think we should give browsers the option to go back and say “no, the old workaround is no longer available”. The old workaround will be around for a long time because people are depending on it, but I think the spec should say, and we should probably be able to feature detect it, that once you switch the source of a track, you get a new track with this API. That’s my opinion. But I’m an old fart.

Jan-Ivar: I just want to give some context. The original reason for “the source must not change” is that it would be confusing to the enumerate model, but that is not valid for the screen capture model.

Eric: What defines a source change? A video track from a tab to a window? Is it just changing pixels? What if the size dramatically changes, is it still the same?

Harald: In my opinion, if the thing that the user thinks of as a source, that’s a source. The user thinks of one camera as one source. The user thinks of that tab as one source, and that tab as another source. If you resize the window, the user will still think of it as the same window. In the case of navigation I am less sure.

Eric: Let’s say I chose to share a window, but then I chose to switch a window. I don’t know if the user thinks of that as different sources.

Elad: The user should not actually see any of this. The user doesn’t listen to events or register to event listeners. What is important is what the developer thinks is a source. The user doesn’t care. The developer should just run the same logic as they always run, decide what actions to do, and do that. So as long as we don’t fire an excessive number of events, it probably does not matter where we draw the line.

Eric: What if I share a window and the developer sets up a bunch of crop targets. The user resizes the window. The contents of the window.. The crop targets may not be valid at that point.

Elad: I think that is a crop target problem. It would be good to know as previously discussed [referring to isCapturable], but there was not a lot of interest in this. There there are already benefits with the current proposal.

Youenn: The model is interesting. But we should look at, but the most important is the impact of the new APIs for the end user. For instance if we create a new track and close the old track and the old track is tied to a peer connection there might be a black frame that is sent. That is an issue. When the track ends in MediaRecorder, recording ends. So we need to decide what is best for the user first. Then we can focus on the web developer. But we should consider the impact this decision has on existing APIs.

Elad: But is it interesting?

Youenn: In terms of API shape I could go with the old and modify it, or I could go with the current proposal. Or maybe something else.

Jan-Ivar: I prefer if the user is in control. I support the use cases but I think we should go with a model that just switches the source[, not ending the track].

Henrik: It sounds like everyone is in agreement with the direction, but more discussion is needed to get the details of the API shape right.

Elad: We’re in violent agreement and this has been recorded.

CONCLUSION: Dynamic switching is important, including adding audio, but more discussion is needed to decide on API shape (e.g. reuse old tracks or create new tracks).

getViewportMedia (capture-current-tab)

Elad: Embedding one app into a video conferencing app. Security issue if it is too easy to opt-in to this, the user could be tricked. We’ve discussed all sorts of mitigation:

  1. Cross-origin isolation
  2. Opt-in by embedded documents

Challenges: Does not yet enjoy wide-spread adoption. It seems like it is relatively challenging to adopt and the bigger the application the harder it is to adopt. It’s not been agreed upon. And this needs to be discussed.

Do we really need opt-in to cross-origin isolation? Meet’s preference at the moment is to not require this. We should proceed this discussion as we get more information but for now we are assuming that we stay with cross-origin isolation and opt in. So in that case we want at least to have the optimal opt-in.

Content that is not opt in is not loaded. Content that is not opt in is not captured. It could either be excluded but that is full of danger I only mention it for completeness. The way I prefer is that things get loaded no matter what, but if some of the content has not opted in, frames stop being emitted.

In other words, if you originally try to capture non-opted in content you throw an exception. But if you did successfully start, then we should just stop emitting frames.

Case study: Here are many extensions that are very popular. There is a strong incentive for these extensions to be embedded. If Google Docs wanted to keep supporting the use case of streaming itself to conferencing, but it had to stop capturing if the extensions were not opted in, then you’re forcing the extensions to have to opt-in. It’s not very democratic. It is better if the third-party content could block and still be loaded. That is a much better place to be in I think.

One problem we had with this previously was feasibility. Could this be implemented without race conditions? I think it could. Jordan and Mark being here is very fortunate. So my idea is while you render content you can record the set of origins to all the extensions that are allowed. And you can just check if the capture is in that set. There is no race condition. But if this is feasible then it is probably feasible for other. Audio and error handling can be discussed later. But I want your opinion on the general concept.

Harald: 2 slides back, you meant intersection?

Elad: [Yes]

Jan-Ivar: I have comments but I also have a question. You mentioned popularity: it looks to me from Chrome status. It looks quite isolated. 0.05% of page loads, but it is going up. Does that match your view?

Elad: That’s not very much.

Jan-Ivar: If I understand correctly, right now the spec says to require cross-origin isolation, but also an additional opt-in policy, I put a link in the chat.

Elad: But you raised internal concerns at Mozilla? Should this document be standardized?

Jan-Ivar: (???) simplifies a lot of things and solves races.

Elad: This discussion is under the assumption that we still have this requirement that nothing is going to load if cross-origin isolation. This is about the second condition about the requirement that if you fail to opt in that is OK. So should it stop the loading of the document that has not loaded in, or should it stop the capturing? I think we should stop only the capturing, because it is more inclusive. It’s good to allow you to load documents even if you can’t capture them because that happens most of the time.

Jan-Ivar: I agree with that. So it would still be a requirement, but it would not be a load requirement? Only a capture requirement. I think there is something to that. But it doesn’t stop races… having to deal with races [is bad].

Elad: If we discover that implementing this is not feasible then we should not do this. But this is under the assumption that we could do this, then is this good?

Jan-Ivar: We’re very open to exploration in that case.

Youenn: Let’s say you have a page and an iframe and iframe has opted in to being captured so then you call the API and it allows capture and that is good. But now let’s say you have two iframe because the second iframe has not opted in. That’s fine as well. But in the case that the second iframe happens during capture, then it makes sense to stop capture, before we actually load the second iframe. That seems to be consistent. If it is possible to implement it, and I don’t see why we couldn’t.

Elad: There is a minimum to implementing this: one is at the network path before you receive the document, another one is on a per-frame basis. Both of these would yield a valid response.

Henrik: Whether you inspect per frame or you temporarily mute and unmute until you know… this very much sounds like implementation details, the user would probably never notice.

Martin: I think Youenn’s approach is the most sensible. I think you pause at the moment of loading, you go off and you stop any captures, and then you can get the result.

Youenn: The spec should be precise enough.

Elad: But in general what direction would you prefer? Block loading or block capture?

<Everyone is in agreement that block capture is preferred>

Elad: Audio might be complicated. But I think we’re closed to time. Let’s move on.

Martin: Just one more thing, video elements on pages can change origin mid playback. So consider that. In that case marking each frame is needed.

Elad: Let’s revisit if we need to but it sounds like we’re in agreement for now.

CONCLUSION: Blocking capture is much preferred over blocking loading.

Status update on work of mutual interest

Elad: We’re short on time so I will go through these very quickly, sorry for the lack of time for questions/discussions.

1. Element Capture

You can use region capture to a target element, both efficiently (in user agent) and robustly (layout event, resize window, etc). We’re adding region capture for this. There are limitations.

Introducing Element Capture, this would give you the specific element, not the content that is in front (occluding) or behind (if transparent). So you use cropTo() for crop targets, and restrictTo() on restrict targets. We decided to do this to avoid a breaking change with apps that already use cropTo(). We might as well have two different tokens.

Slide showing an example of capturing a specific element. This is already in very late stages of development, you can already use it behind a flag and try it out if you like. Let us know if you find bugs.

We don’t have time to discuss the risks. Sorry for that but we want to give Zoom time.

2. Captured Surface Control

The preview of a captured surface is not interactive. What if you want to scroll without changing tab…

We believe that scrolling and zooming is generally safe enough that if we had a permission prompt then we could let this happen. But scrolling is a bit more complicated: which point in the page are you scrolling?

Proposal to add an event handler to the preview, and then get coordinates. So if the user wants to start scrolling the other tab you can have a permission, there may be a popup informing the user about this.

The spec should also be able to control the zoom level. I hope we can define this in a way to allow zooming in a cross browser compatible way.

3. Capture Mouse Events

Use case: show mouse, zoom in around mouse, etc.

If you want to highlight where the cursor is, if it is being pressed, etc, this would be very difficult to do. But if you could do that then you could record the screen and then do some effects like zooming in and following the mouse cursor.

We’ve also discussed the possibility of exposing mouse buttons and modifier keys and this would allow you to annotate the events. Another thing we could do is capture only when things that change and instead of recording the mouse as video you can send the coordinates to the other side and it knows where the mouse is. Saves power and CPU.

Questions?

Bernard: In a scenario where you’ve removed the mouse, why would we need to send anything at all? There are codecs that allow you to just send a new image, such as the AV1 keyframe format, to use the screen content codec.

Elad: The idea is that you would have to scan the image over and over again to discover if something has changed. Or I’m not sure I understood the question.

Harald: This spec does not define a spec on how to send the mouse event over the wire. Would it be easy to do that? JSON-serialize?

Elad: I don’t know if it would be simple, it depends how much time engineering has. But yes it would be possible to do it later. But it is not necessary to do it ahead of time, because you could just send the coordinates on a separate channel which is not standardized [e.g. data channels]. If we wanted to standardize we could always move them to a standardized pipe later.

Harald: But I imagine that it is the same app on both sides so I’m not sure if there needs to be a standard way to transmit coordinates.

Jan-Ivar: I wonder if there are statistics on what people are capturing. If they are capturing web surfaces then maybe we could just reuse existing mouse events?

Elad: We ran an experiment where tab capture increased from 16% to 30% and it may be even higher now, but we would like to push this even higher.

Jan-Ivar: This is not an issue for self capture, right, because you can always look at mouse events.

Elad: But self capture is very rare.

Jan-Ivar: Right.

Elad: Only 0.8% chose the same tab and I think it was a user mistake. I don’t think other apps have any good reason not to exclude the self capture, like Zoom probably not either.

What about also adding mouse clicks to this? I can’t think of a way that this could backfire.

Jan-Ivar: I think I have to think more.

Elad: Looking forward to more feedback, but I want to give time to Zoom. [It turns out Zoom people have left the call so the remaining minutes are spent on more discussions.] OK, does anyone want to give feedback on any other slides that we moved fast through earlier?

Martin: I raised two issues earlier. The first is, element capture is another opportunity for fingerprinting if there is no consent allowed. Which could be fine, but it’s worth noting. I don’t know if there’s any intent to use consent so I’m guessing. But should there be some user consent for capture?

Elad: Yes this builds on existing self capture, so there is already permission involved, and this API only transmutes an existing self capture.

Martin: I am concerned that enabling the capture of occluded elements on screens creates a transparency awareness problem that is a new problem.

Elad: Would people here feel more comfortable if we only allowed this for web sites that already had microphone and video capture as a heuristic that this is a more trusted web site?

Martin: I would need to see the proposal and think more.

Jan-Ivar: I think we should not overload more things on top of that.

Elad: Just to clarify the motivation, we want to allow cool stuff for the user, but we don’t want to create privacy problems. I offered this only as a way to calm more nerves for a feature that is more ground breaking.

Jan-Ivar: My position remains that the occlusion remains a concern. I would rather see this problem addressed within getViewportMedia, for which the dominant use case. We could temporarily pause capture or add transparency or… I think we should only capture what is visible to the user so that an app cannot capture something not known to the user.

Elad: We’re out of time but thank you very much. Let’s continue the discussions on the specs.