WebRTC WG / Media WG Joint Meeting

Meeting minutes

Capture Handle Actions & Media Session Actions

jan-ivar: Two specifications, Capture Handle Identity & Capture Handle Actions

<steely-glint> Can someone PM me the webex password - somehow my password manager has got into a fight with the w3c's auth - Thanks.

<tidoust> Capture Handle Actions

jan-ivar: When presenting via a WebRTC Call, if the presentation is in another browser window, the user may background the browser tab/view which is running the presentation
… The use case would be an integrated solution, where video conferencing and presentation are going at the same time
… short of that, it would be good to be able to control the presentation through the browser view which is doing the call
… The goal is to standardize the actions which can be supported through presentation/capture pairs

youenn: Media Session one use cases are to control the page, from among other use cases, to control the page through e.g. a Picture-in-Picture window
… This seems very similar to what we are trying to achieve through Media Session actions.
… Except in this case, the actions would be sent to the capturing page, not to the presentation page
… One thing that is nice about Media Session Actions is that they are already in place and are supported
… It would be nice if we could be re-use the existing supported API for WebRTC, rather than come up with our own
… If we were able to share or re-use that API, adoption would come for free.
… On the other hand, if it's required to adopt a new, capture-only API, while WebRTC clients may support it, sites which the user is capturing (e.g. presentation sites) may not add support for those actions

jan-yvar: Media Session does appear to pave the way for useful actions, like "next" and "previous track", however these do not work on Google Slides today.
… Perhaps because these are more music or media related; are "next track" the same as "next slide"?
… We considered whether these were in-process buttons provided by the page; however, would those instead be UI presented by the UA itself?
… There are security concerns; for example, even with a UG requirement, it's the capturing page that is controlling the presentation page; there may be security implications to allowing these pages to communicate or control cross-site.

Elad: Things I've heard from web developers:
… 1. they don't really like UA provided controls; it clashes with their own UI

<jib> s/jean-yvar/jan-ivar/

Elad: And they can't provide a consistent UI across different browsers.
… 2. When video capturing site captures Google Slides; there's a login pattern where they require the same account to be signed into both tabs
… This is a pattern they require. It's not clear that even if this API existed, that sites would want arbitrary other sites to control their presentation

youenn: That would be something worth investigating with Google.

Elad: to make this work generally, the API has to provide more information; namely the origin that generated the message.

youenn: that's something that's already solvable through javascript
… We're targeting the 80% case, and allow JS to handle the 20%

jernoble: This sounds like something broadcastchannel already provides

elad: If you have multiple sessions with google slides, you don't want them all to respond
… so use capture handle identity, and capture handle actions that lets you talk directly to the thing you're capturing

jan-ivar: the concern is that only lead to siloing, can we provide a baseline set of actions that need a minimal setup, the 80-20 case

harald: With media actions there's an interop concern with different device buttons and applications

Harald: The goal would be to allow a page written by a google developer to control presentations written by Microsoft Office or vice versa
… Something to consider is if a common registry of actions and models between different presentation types (Spotify vs. Slideshows, e.g.)
… And the Media Session actions have a lot of metadata about those actions (speed, seeking to particular time)

jan-ivar: two options: we could have two APIs where developers could have to opt into both and separate implementations, or there could be a single API that's driven by either hardware buttons or a capturing sites

dom: I want to give other examples of where sites are using these apis already. For example, when embedding a YouTube video, sites must use postMessage to communicate with the embedded player
… There has been a natural convergence on these APIs in a non-standard way.
… So this is an example of an existing situation where different sites/origins want to communicate actions to each other
… It would be useful to reduce the semantics across these use cases to a common set.

Elad: We should not go with an API shape that makes everything work with existing sites; there are security implications to allowing sending messages cross-origin

youenn: we need to study and enumerate those security issues and provide mitigations if necessary

cpn: Can we hear from someone from a Media Session perspective?
… Are we imagining a combined set of actions between media and presentation use cases?

eric_carlson: I can imagine a page wanting to provide both media actions and slide actions; so having separate actions for the two use cases would remove the possibility of confusion about which action to perform
… and we have already added new actions to the MEdia Session API

youenn: Agreed, you may want to "play/pause" media within a slide in a presentation
… The Media Session registry could handle that

jan-yvar: What does the "hangup" action do?

eric_carlson: It allows UAs to provide a "mute" or "hangup" action similar to the one a page would provide

jan-yvar: A conservative view would be that Media Session is narrowly about AV playback; however "mute" and "hangup" are more about camera capture
… would people think we should re-use "next track" and "previous track" actions to support page changes?

jernoble: web authors have wanted to reuse the media session API to move between slides, so seems reasonable to add actions for those cases

Elan: How do sites know what actions are supported across origins?
… e.g., how do sites know whether they should send the 'next track' or 'next slide' action?

youenn: for WebRTC, the site might need to know what actions are registered.
… Perhaps we need to provide that information through a new capture api

Elan: from the side of the site being captured; it's not confusing
… but from the capturing side, it could get confusing about which action should be sent
… what happens when the user hits the "next" button on their keyboard?

jernoble: The UA knows which actions have been registered so can route the user input from hardware controls accordingly
… You want the action to go to the frontmost, as least in one implementation it goes to the current playing browser tab
… This is outside the spec, on iOS only one thing can play audio at a time, so it would be the most recently played browser tab
… For MacOS where multiple things can play audio, it would be the one that most recently started playing

dom: it seems to me we should try to figure out how to move forward with the broader discussion on whether application semantics can be exposed to the browser, and to sites
… part of the question is: is next/previous slide, something that could get traction. Question of feasibility. Would sites implement and would browsers provide controls in their chrome
… For website to website, there's a security framework question, can we delegate controls and under what conditions?
… How to go about discussing more deeply?

jan-ivar: if media session wanted to move closer to capture actions, by using next/prev slide there'd have to be a current capture session. I can open issues on Media Capture Session if that's a way forward

eric: sounds good to me

chcunningham: I'll check with the Media Session team internally here, current editors have moved on, and I'll reach out them to nominate a new editor
… if others want to edit the spec, that would be welcome

cpn: Are we seeing that control within a page can influence actions on the captured page?
… it's my understanding that media session API is to allow the UA to control a page; does this fit with the design of Media Session to allow another page to send actions rather than the UA?

eric_carlson: It does make sense for me.

jan-ivar: there are security implications; perhaps "toggle mic" is not the best thing to expose cross site
… there's also another argument that you can use morse-code (or similar) to communicated arbitrary data across
… however, for capture, there's already a lot of information flowing from the captured page to the capturer

dom: It's not just security across the two sites; it's also about the impact to the end user. This will require analysis of the risks the end user will face.

jan-ivar: This is why remote control of a site is out of scope for WebRTC.

dom: It is the recipient's understanding that the action is coming from the UA and not another site
… the expectations of the two may not match.
… this may not be a real issue, but it does need analysis.

harald: if the event can come from multiple sources, the message should include enough information to tell the difference between the sources.

Elad: There are 3 levels: 1. knowing that this came from another origin, 2. knowing the origin that the message came from, and 3. knowing which user on that other origin issued the message.

Tim: I would like to refine that and say above and beyond that the message came from another site, but that it came from a local user. How do we know that the event didn't originate outside the local machine, like another user on the call?

<dom> [shared control of slideset would actually be useful too]

Tim: We should be more distinct about whether we can prove that the local user was the origin of the message

Elad: And a user gesture requirement does not guarantee the intent

Tim: We do need careful thought about these potential security issues

Elad: That is why I think we need the remote site to adopt a specific API, as a caveat-emptor

Tim: We need more in the origin than just the origin, if that makes sense.

jan-ivar: We have an existing issue to whether we should extend Media Session to support new actions
… We would need a separate issue to track whether actions should be sent across origins.

cpn: Would we use the Media Session repo for these discussions?

jan-ivar: The questions raised are more for Media Session; to consider whether the scope of Media Session should be expanded to send actions from a page

Elad: What is the argument for using Media Session if we need specific adoption?
… are the two APIs truly similar enough to justify only a single API surface for both?

Harald: we have competing concerns: both functional concerns about having the correct thing happen when you press a button, and security concerns as well.

ACTION: capture these concerns and issues in the Media Session github

ACTION: Chris to follow up internally about new editors for the Media Session specification itself.

cpn: what would be the timeline for this?

Harald: two weeks would be good; four weeks at the maximum

cpn: lets continue to work together across the two WGs.

– DRAFT –
WebRTC WG / Media WG Joint Meeting

25 April 2022

Attendees

Meeting minutes

Capture Handle Actions & Media Session Actions

Summary of action items

Diagnostics