WebRTC WG / Media WG Joint Meeting

25 April 2022


Bernard_Aboba, Chris_Cunningham, Chris_Needham, dom, Elad_Alon, Eric_Carlson, Francois_Daoust, Greg_Freedman, Harald_Alvestrand, Jan-Ivar_Bruaroey, Jer_Noble, Tim_Panton, Tommy_Steimel, Youenn_Fablet
Bernard Aboba, Chris Needham (Media WG), Harald Alvestrand (WebRTC WG), Jan-Ivar Bruaroey, Jer Noble
cpn, jernoble

Meeting minutes

Capture Handle Actions & Media Session Actions

jan-ivar: Two specifications, Capture Handle Identity & Capture Handle Actions

<tidoust> Capture Handle Identity

<steely-glint> Can someone PM me the webex password - somehow my password manager has got into a fight with the w3c's auth - Thanks.

<tidoust> Capture Handle Actions

jan-ivar: When presenting via a WebRTC Call, if the presentation is in another browser window, the user may background the browser tab/view which is running the presentation
… The use case would be an integrated solution, where video conferencing and presentation are going at the same time
… short of that, it would be good to be able to control the presentation through the browser view which is doing the call
… The goal is to standardize the actions which can be supported through presentation/capture pairs

youenn: Media Session one use cases are to control the page, from among other use cases, to control the page through e.g. a Picture-in-Picture window
… This seems very similar to what we are trying to achieve through Media Session actions.
… Except in this case, the actions would be sent to the capturing page, not to the presentation page
… One thing that is nice about Media Session Actions is that they are already in place and are supported
… It would be nice if we could be re-use the existing supported API for WebRTC, rather than come up with our own
… If we were able to share or re-use that API, adoption would come for free.
… On the other hand, if it's required to adopt a new, capture-only API, while WebRTC clients may support it, sites which the user is capturing (e.g. presentation sites) may not add support for those actions

jan-yvar: Media Session does appear to pave the way for useful actions, like "next" and "previous track", however these do not work on Google Slides today.
… Perhaps because these are more music or media related; are "next track" the same as "next slide"?
… We considered whether these were in-process buttons provided by the page; however, would those instead be UI presented by the UA itself?
… There are security concerns; for example, even with a UG requirement, it's the capturing page that is controlling the presentation page; there may be security implications to allowing these pages to communicate or control cross-site.

Elad: Things I've heard from web developers:
… 1. they don't really like UA provided controls; it clashes with their own UI

<jib> s/jean-yvar/jan-ivar/

Elad: And they can't provide a consistent UI across different browsers.
… 2. When video capturing site captures Google Slides; there's a login pattern where they require the same account to be signed into both tabs
… This is a pattern they require. It's not clear that even if this API existed, that sites would want arbitrary other sites to control their presentation

youenn: That would be something worth investigating with Google.

Elad: to make this work generally, the API has to provide more information; namely the origin that generated the message.

youenn: that's something that's already solvable through javascript
… We're targeting the 80% case, and allow JS to handle the 20%

jernoble: This sounds like something broadcastchannel already provides

elad: If you have multiple sessions with google slides, you don't want them all to respond
… so use capture handle identity, and capture handle actions that lets you talk directly to the thing you're capturing

jan-ivar: the concern is that only lead to siloing, can we provide a baseline set of actions that need a minimal setup, the 80-20 case

harald: With media actions there's an interop concern with different device buttons and applications

Harald: The goal would be to allow a page written by a google developer to control presentations written by Microsoft Office or vice versa
… Something to consider is if a common registry of actions and models between different presentation types (Spotify vs. Slideshows, e.g.)
… And the Media Session actions have a lot of metadata about those actions (speed, seeking to particular time)

jan-ivar: two options: we could have two APIs where developers could have to opt into both and separate implementations, or there could be a single API that's driven by either hardware buttons or a capturing sites

dom: I want to give other examples of where sites are using these apis already. For example, when embedding a YouTube video, sites must use postMessage to communicate with the embedded player
… There has been a natural convergence on these APIs in a non-standard way.
… So this is an example of an existing situation where different sites/origins want to communicate actions to each other
… It would be useful to reduce the semantics across these use cases to a common set.

Elad: We should not go with an API shape that makes everything work with existing sites; there are security implications to allowing sending messages cross-origin

youenn: we need to study and enumerate those security issues and provide mitigations if necessary

cpn: Can we hear from someone from a Media Session perspective?
… Are we imagining a combined set of actions between media and presentation use cases?

eric_carlson: I can imagine a page wanting to provide both media actions and slide actions; so having separate actions for the two use cases would remove the possibility of confusion about which action to perform
… and we have already added new actions to the MEdia Session API

youenn: Agreed, you may want to "play/pause" media within a slide in a presentation
… The Media Session registry could handle that

jan-yvar: What does the "hangup" action do?

eric_carlson: It allows UAs to provide a "mute" or "hangup" action similar to the one a page would provide

jan-yvar: A conservative view would be that Media Session is narrowly about AV playback; however "mute" and "hangup" are more about camera capture
… would people think we should re-use "next track" and "previous track" actions to support page changes?

jernoble: web authors have wanted to reuse the media session API to move between slides, so seems reasonable to add actions for those cases

Elan: How do sites know what actions are supported across origins?
… e.g., how do sites know whether they should send the 'next track' or 'next slide' action?

youenn: for WebRTC, the site might need to know what actions are registered.
… Perhaps we need to provide that information through a new capture api

Elan: from the side of the site being captured; it's not confusing
… but from the capturing side, it could get confusing about which action should be sent
… what happens when the user hits the "next" button on their keyboard?

jernoble: The UA knows which actions have been registered so can route the user input from hardware controls accordingly
… You want the action to go to the frontmost, as least in one implementation it goes to the current playing browser tab
… This is outside the spec, on iOS only one thing can play audio at a time, so it would be the most recently played browser tab
… For MacOS where multiple things can play audio, it would be the one that most recently started playing

dom: it seems to me we should try to figure out how to move forward with the broader discussion on whether application semantics can be exposed to the browser, and to sites
… part of the question is: is next/previous slide, something that could get traction. Question of feasibility. Would sites implement and would browsers provide controls in their chrome
… For website to website, there's a security framework question, can we delegate controls and under what conditions?
… How to go about discussing more deeply?

jan-ivar: if media session wanted to move closer to capture actions, by using next/prev slide there'd have to be a current capture session. I can open issues on Media Capture Session if that's a way forward

eric: sounds good to me

chcunningham: I'll check with the Media Session team internally here, current editors have moved on, and I'll reach out them to nominate a new editor
… if others want to edit the spec, that would be welcome

cpn: Are we seeing that control within a page can influence actions on the captured page?
… it's my understanding that media session API is to allow the UA to control a page; does this fit with the design of Media Session to allow another page to send actions rather than the UA?

eric_carlson: It does make sense for me.

jan-ivar: there are security implications; perhaps "toggle mic" is not the best thing to expose cross site
… there's also another argument that you can use morse-code (or similar) to communicated arbitrary data across
… however, for capture, there's already a lot of information flowing from the captured page to the capturer

dom: It's not just security across the two sites; it's also about the impact to the end user. This will require analysis of the risks the end user will face.

jan-ivar: This is why remote control of a site is out of scope for WebRTC.

dom: It is the recipient's understanding that the action is coming from the UA and not another site
… the expectations of the two may not match.
… this may not be a real issue, but it does need analysis.

harald: if the event can come from multiple sources, the message should include enough information to tell the difference between the sources.

Elad: There are 3 levels: 1. knowing that this came from another origin, 2. knowing the origin that the message came from, and 3. knowing which user on that other origin issued the message.

Tim: I would like to refine that and say above and beyond that the message came from another site, but that it came from a local user. How do we know that the event didn't originate outside the local machine, like another user on the call?

<dom> [shared control of slideset would actually be useful too]

Tim: We should be more distinct about whether we can prove that the local user was the origin of the message

Elad: And a user gesture requirement does not guarantee the intent

Tim: We do need careful thought about these potential security issues

Elad: That is why I think we need the remote site to adopt a specific API, as a caveat-emptor

Tim: We need more in the origin than just the origin, if that makes sense.

jan-ivar: We have an existing issue to whether we should extend Media Session to support new actions
… We would need a separate issue to track whether actions should be sent across origins.

cpn: Would we use the Media Session repo for these discussions?

jan-ivar: The questions raised are more for Media Session; to consider whether the scope of Media Session should be expanded to send actions from a page

Elad: What is the argument for using Media Session if we need specific adoption?
… are the two APIs truly similar enough to justify only a single API surface for both?

Harald: we have competing concerns: both functional concerns about having the correct thing happen when you press a button, and security concerns as well.

ACTION: capture these concerns and issues in the Media Session github

ACTION: Chris to follow up internally about new editors for the Media Session specification itself.

cpn: what would be the timeline for this?

Harald: two weeks would be good; four weeks at the maximum

cpn: lets continue to work together across the two WGs.

Summary of action items

  1. capture these concerns and issues in the Media Session github
  2. Chris to follow up internally about new editors for the Media Session specification itself.
Minutes manually created (not a transcript), formatted by scribe.perl version 185 (Thu Dec 2 18:51:55 2021 UTC).


Succeeded: s/jean/jan/

Failed: s/jean-yvar/jan-ivar/

Maybe present: chcunningham, cpn, Elad, Elan, eric, harald, jan-ivar, jan-yvar, jernoble, Tim, youenn