WebRTC TPAC F2F - Day 2 – 13 September 2022

Meeting minutes

Slideset: https://lists.w3.org/Archives/Public/www-archive/2022Sep/att-0000/WEBRTCWG-2022-TPAC.pdf

WebRTC Encoded Transform

Harald: encoded transform is fine for crypto, but not fine for other manipulation use cases

Issue #106: Add use cases that require one-ended encoded streams

Harald: several use cases where you want to connect a stream to somewhere else after processing
… not sure what a proper API would look like, so thought we should go back to requirements

youenn: looking at the use cases - they probably deserve different solutions
… e.g. webtransport probably shouldn't use peerconnection
… alternative encoders/decoders - sounds like a different API altogether
… metadata may be done prior to PC

Harald: encoded transform is a stream source connected to stream sink
… a one-ended stream has only one of these
… we have an ecosystem of media & encoders that people have gotten used to
… if we can plug into this ecosystem, it seems a better solution than creating novel solutions for this
… it might be that we decide that it's not the same ecosystem
… in which case we might kick the ball over to media

youenn: starting from use cases and then deriving requirements as done for WebRTC-NV would be useful to do here
… it's easier to derive APIs from requirements

harald: the SFU-in-browser is a technique to achieve the scalable video conferencing use case we discussed yesterday

youenn: describing use cases in more details and then derive requirements from there

jib: +1 on better description of use cases

Bernard: the NV use cases has no API that satisfies the requirements
… WebTransport doesn't support P2P; the only path is RTP

JIB: so the idea would be to expose an RTP transport to JS

Bernard: or make the datachannel a low-latency media transport, but there doesn't seem to be much stomach for that

Harald: we have a discussion scheduled on whether to consider a packet interface in addition to a frame interface
… We'll detail the use cases more to figure out if an extension of media stream is relevant or if we need something completely different

Issue #90: Pluggable codecs

[Slide 80]

Harald: we've been lying to SDP about what we're transporting

<martinthomson> what is ECE?

Harald: to stop lying, we need a mechanism to allow the app to tell the SDP negotiation that they're doing something else than the obvious thing

<fippo> probably EME (there was a comment + question in the slides)

youenn: this may lead us to a conclusion than encoded transform was a mistake

<martinthomson> ...I tend to think that this is already broken

youenn: the other possibility, you could state during negotiation that you're going to use app-specific transforms
… letting intermediaries know about this
… we tried to push this to IETF AVTCore, without a lot of success

Harald: maybe MMUSIC instead?

Cullen: it's worth trying again - slow move has been the pattern in the past 2 years, not a signal

Bernard: the reason why SFRAMe has not moved in AVTCore is because nobody showed up, drafts were not submitted, and the area director is considering shutting down the SFrame WG

Youenn: I went to several meetings, tried understand the submitted issues, but struggled to find solutions that would satisfy
… the work has been stalled for lack of consensus

herre: can we move forrward without the dependency on IETF, by allowing the JS to describe its transform to the other party?

Youenn: encoded transform has a section on SFrame transform, which wasn't pointing to an IETF draft until recently

Harald: the scripttransform is fully under the app control, but it doesn't have a way to tell the surrounding system it changed the format
… we could add an API before the IETF work emerges

Martin: SFrame is very close to death, I expect some more work to be done though
… once you give script access to the payload, anything is possible
… this breaks the assumptions under which the encoder and packetization operate
… I don't think letting the script write the SDP, we need a model that makes sense, not sure what it would be

Youenn: we had a model with the traditional video pipeline including a break into it
… we could open it more and exposing more of the states of the pipeline
… we could expose e.g. bitrate if useful, based on use cases
… for pluggable codecs, you need to set a break before webrtc encded transform & the track, and be able to set a special packetization

martin: you'd want to deal with raw media (the track), then do the encoding and the packetization

youenn: not sure we need all the breaks

Issue #31 & Issue #50: Congestion Control

[Slide 81]

Martin: none of this is necessary if you're looking at just mutating packets

Harald: not if the size or number of packets can change

Martin: some of it can be modeled as a network-specific MTU for the SFrame transform

Harald: the downstream would need to expose its MTU, and the SFrame transform would share its MTU upstream

Martin: but beyond, this is looking at the entire replacement of the chain

Youenn: the AR/VR use case is where data can be much bigger when you attach metadata
… one possible implementation is to do this with ScriptTransform to stuff metadata in the stream, as a hack
… not sure if we should accept this as a correct use of the API
… in such a use case, expanding the frame size means the bitrate is no longer correct
… the UA could instruct the encoder to adapt to the new frame size
… or we could expose new APIs

<peter> Isn't the targetBitrate already in webrtc-stats?

martin: AR/VR is probably a wrong usage of ScriptTransform
… it would better be handled as a different type of MediaSTreamTrack
… this points toward being able to build a synthetic media flow

martinthomson: it would seem better to look at it this way rather than through a piecemeal approach
… the AR/VR points toward synthetic media flows

Bernard: people have tried using the datachannel for AR/VR
… didn't work for A/V sync or congestion control
… they want an RTP transform
… the A/V stream helps with sync
… if you put it in a different flow, how do you expose it in SDP
… it's the only way available in the WebRTC model today

fluffy: on WebEx hologram, we do exactly what Martin describe
… we send a lightfield in a stream that looks like a video stream
… same for hand gestures etc
… all of this sent over RTP
… it's low bit-rate data, doesn't need to adapt like audio
… lightfield instead needs bandwidth adaptation
… this could apply to haptics, medical device data being injected in a media stream

TimP: part of our problem has been mapping all of this to SDP, for things created on the fly
… describing things accurately in SDP is a lost cause as we'll keep inventing new things

<martinthomson> Steely_Glint_: SDP is extensible....

TimP: we should be describing the way we're lying (e.g. we're going to add 10% to the bandwidth; it won't be H264 on the way through)
… without trying to describe it completely

acl peter

s/acl peter

Peter: I had proposed an RTP data mechanism a few years ago, which sounds similar
… we could have an SDP field to say this is arbitrary bytes
… or construct something without SDP

Martin: I was suggesting new type of RTP flows with new "codecs"
… browsers can't keep up with all the ways that SDP would be used; we should instead give a way for apps to describe their "codecs" via a browser API

Issue #99 & Issue #141: WebCodecs & WebRTC

[Slide 87]

youenn: Both WebRTC and WebCodecs expose similar states
… but there are differences e.g. in mutability

<jesup> I strongly agree with Martin's comments; these data-like should be "codecs", which allows for much more flexibility, specification, and interoperability

youenn: should we try to reconcile? should we reuse webcodecs as much as possible?

<Steely_Glint_> But we do need (in sframe) to allocate a suitable codec (say h264) - the 'generic' pass through drops that into

youenn: I propose we stick to what we shipped

DanSAnders: from the WebCodecs side, that sounds like a good approach
… we don't have a generic metadata capability

harald: so we should document how you transform from one to the other
… it's fairly easy to go from webrtc to web codecs
… the reverse is not possible at the moment

<Bernard> Youenn: we can create constructors to build RTCEncodedVideoFrame from EncodedVideoChunk

herre: if we move to the one-ended model, this creates trouble in terms of ownership and lifecycle

youenn: we deal with that problem in Media Capture transform through enqueuing via cloning (which is effectively a transfer)

<peter> +1 to constructors for RTCEncodedVideoFrame/RTCEncodedAudioFrame

Bernard: re constructors to get from one type to another, allowing conversion between the two

jib: your proposal doesn't address the mutability of metadata

youenn: the particular metadata I'm referring to aren't mutable

<Bernard> Harald: this model does not support the use cases we have been discussing.

youenn: can we close the issue or should wait until the architecture get designed?

Harald: I hear support for the two-ways transform

youenn: let's file an issue specifically about that and close these 2 issues

Issue #70: WebCodecs & MediaStream transform

[Slide 88]

[Slide 89]

DanSanders: proposal 1 is straightforward
… we don't have a metadata API for lack of a good enough technical proposal
… the mutation/cloning aspect is the challenge
… e.g. cropping may generate no longer accurate data about face detection
… it depends on what cloning does

peter: are we talking about how the metadata would go over the network?

youenn: here we're focusing on mediastreamtrack as a series of frames
… we don't have a good solution for moving it over the network as we discussed in the previous item
… the WebRTC encoder could be a pass-through for the metadata, but it's still up in the air - we welcome contributions

chris: in webcodecs, there is some request to expose H265 SCI metadata for user defined data

<miseydl> some meta information might be provided by the containerization of the video codec itself (NAL info etc) would we populate that generic meta array with those infos?.

chris: that would presumably be closed expose with videoframe
… it would be useful to look at the use cases together

Dan: this is kind of low priority because of low multiplatform support
… if we have a metadata proposal that works, it could be used here

youenn: we had someone sharing such an approach - although it's codec specific

chris: we'll also continue discussing this at the joint meeting with Media

harald: metadata has some specific elements: timestamp, ssrc, dependency descriptors
… the last one obviously produced by the encoder
… mutable metadata - if constructing a new frame is very cheap, we don't need mutability

DanSanders: it's quite cheap, just the GC cost

Harald: we'll continue the discussion at the joint meeting & on github

Issue #143: generateKeyFrame

[Slide 90]

<Ben_Wagner> WebCodecs spec requires reference counting: https://www.w3.org/TR/webcodecs/#raw-media-memory-model-reference-counting

Peter: what about returning multiple timestamps?

youenn: that's indeed another possibility

<martinthomson> does it even need to return something?

youenn: but then the promise will resolve at the time of the last available keyframe

martinthomson: does it need to return anything, since you're going to get the keyframes as they come out?

youenn: it's a convenience to web developers to return a promise (which also helps with error reporting)

martinthomson: at the time the promise resolve, it resolves after the keyframe is available, which isn't the time you want

<miseydl> one could also use the timestamp to associate/balance keyframerequests, which is useful for various reasons.

youenn: it's resolved when the frame is enqueued, before the readablestream

martinthomson: this seems suboptimal if what you want is the key frame
… if frames are enqueued ahead of the keyframe

youenn: in practice, the expectation that you'll be polling the stream otherwise your app is broken

martinthomson: with machines that jank for 100s of ms

youenn: the promise can also be used to return an error, which I don't think can be validated asynchronously

martinthomson: that argues for a promise indeed; not clear that the timestamp return value

fluffy: what you want to know is that the keyframe has been encoded; the timestamp is irrelevant

youenn: so a promise at the timing we said, but not timestamp

Peter: would it be reasonable to have an event when a keyframe is produced?

youenn: you do that by reading the stream and detecting K frames

Peter: I like proposal 3 as a way to cover the situations you want

TimP: the way I recalled it was the purpose of the timestamp was to help with encryption through sframe for key change

martinthomson: this can be done by waiting to a keyframe in the stream before doing the key change
… I also don't think it's strictly necessary to resolve the promise upon enqueuing

<jesup> +1 for proposal 3. Simple. Agree with mt

martinthomson: it could be done when the input has been validated

RESOLUTION: go with proposal 3 without returning a timestamp

Conditional Focus

[Slide 93]

Elad: screen sharing can happen in situations of high stress for the end user
… anything that distracts the user in that moment is unhelpful
… the API we're discussing is to help the app set the focus on the right surface

[Slide 94]

[Slide 95]

[Slide 96]

elad: still open discussion on default behavior when there is a controller

youenn: re task, we want to allow for the current task - there is no infrastructure for that, but implementations should be able to do that
… a bigger issue: Chrome and Firefox have a model where the screenshare picker always happen within the chrome
… it's very different in Safari - picking a window focuses the window
… so the behavior would be to focus back on the browser window
… being explicit on what is getting the focus would be better, so setFocusBehavior would be an improvement
… I don't think we should define a default behavior since we're already see different UX across browsers
… I would also think it's only meaningful for tabs - for window, they could determine it as the time of gDM call

elad: re different UX models, we could fallback to make that a hint
… re window vs tab, it may still be useful as a hint to adapt the picker

youenn: unlikely we would do something as complex

jan-ivar: I'm actually supportive of option 2
… regarding applicability to window - for screen recording apps, the current behavior hasn't proved helpful

youenn: but this could be done via a preset preference in the gDM call

jan-ivar: we could, although maybe a bit superfluous

jib: setFocusBehavior is a little more complicated, more of a constraint pattern with UA dependent behavior
… but don't feel very strongly
… but yeah, turning off focus by adding a controller doesn't sound great

RESOLUTION: setFocusBehavior as a hint with unspecified default applicable to tabs & windows

youenn: deciding to not focus is a security issue - it increases the possibility Web pages to select a wrong surface
… since this lowers security, there should be guidelines for security considerations

Elad: should this be a separate doc?

youenn: let's keep it in screen-share

jib: +1 given that we're adding a new parameter to getDisplayMedia

<fluffy> scribe fluffy

<fluffy> zakim. scribe fluffy

<fluffy> Proposing cropTargets in a capture handle

Screen-sharing Next Steps

Slideset: https://lists.w3.org/Archives/Public/www-archive/2022Sep/att-0003/WEBRTCWG-2022-TPAC__1_.pdf

mark: setting the crop target on the capture handle - is that serializable / transferable ?

youenn: serializable

mark: then it could be transferred over the messageport

elad: but there is no standard format for that

youenn: re crop target serializability, +1
… I'm not sure yet about having cropTargets in capture handle
… it may require more data, e.g. different cropping for different sinks
… having app specific protocol might be a better way to start before standardizing a particular one
… re MessagePort, the security issues can be solved
… re content hint, I'm not convinced
… the capturer doesn't have to provide the hint, the UA can do it itself

elad: so 3 comments:
… - cropTargets may need more context (although my main use case is for a single cropTarget)

youenn: this could be dealt on a per-origin protocol agreement

elad: but that doesn't work with non-pre-arranged relationship

jan-ivar: this MessagePort would be a first in terms of going cross-storage (not just cross-origin) - definitely needs security review
… this could still be OK given how tied to user action and the existing huge communicaiton path via the video sharing
… In the past, we've tried to piecemeal things by not having a MessagePort
… part of the feedback I've been getting is maybe to just have a MessagePort, as that would be simpler and help remove some of the earlier mechanisms we had to invent
… thank you for suggesting cropTargets to allow non-tightly-coupled catpuree-capturer
… I'm not sure if it's necessary if we're moving to a MessagePort

<youenn> @jib, window.opener can postMessage probably.

elad: I don't think a MessagePort could replace the capture handle, since it only works for cooperative capturee/capturer
… also the messageport alerts the capturee of ongoing capturer, with possible concerns of censorship
… I think we need to address them separately

hta: thanks for the clarification on MessagePort being orthogonal to CropTarget
… MessagePort is two-ways were capture handle is one-way, this may have a security impact
… I think these 2 proposals are worth pursuing (as a contributor)
… not convinced yet about content hint
… should this linked to a crop target instead?

elad: would make sense

TimP: I like all of this, and do like the multiple crop targets and notes
… the MessagePort shouldn't replace the rest of this, it's more complicated for many developers
… I like the 2 layers approach

fluffy: I find the security issues with MessagePort concerning without more details
… re trusting or not web sites for content hint - the capturer could determine it

elad: content hint helps with setting the encoder correctly

[Slide 108]

jib: I don't think there is new information to change our current decision, nor have I had enough time to consider this

Encoded transform

Issue #131 Packetization API

[Slide 80]

hta: would this packetization & depacketization?

youenn: we would probably need both, good point

Peter: we could add custom FEC to the list as a valid use case
… being able to send your own custom RTP header would be nice
… although that would be possible to put in the payload if you had control over it

richard: this points toward an API that transforms the packets à la insertable stream
… SPacket is simpler for encryption

Bernard: we need to be able to packetize and depacketize if we use it for RED or FEC
… you need to be able to insert packets that you recover

HTA: I don't think we can extended the encodedvideoframe for this, it's the wrong level
… we need an rtcencodedpacket object probably
… any impression on whether that's something we should do?
… do we have enough energy to pursue this?

Bernard: a bunch of use cases would benefit from this

Peter: I'm energetic on it

richard: +1
… esp if we focus on a transformation API

HTA: next steps would be writing up an explainer with use cases, and a proposed API shape

<rlb> happy to help, if youenn is willing to drive :)

Action items & next steps

HTA: we had seen some serious architecutre discussions on encoded media - I'll take the action item to push that forward
… Elad is on the hook for capture handle
… and we have 3 signed up volunteers for packetization

Bernard: we had good discussion on use cases we want to enable

JIB: we also closed almost of the simulcast issues

Elad: I'm looking into a proposal for an element capture API to generate a mediastreamtrack without occluded content - it has security issues that will need to look into
… this will be discussed at a breakout session tomorrow at 3pm PT

HTA: we also have a joint meeting with Media WG on Thursday - we'll discuss metadata for video frames there

[adjourned]

– DRAFT –
WebRTC TPAC F2F - Day 2

13 September 2022

Attendees