WebRTC WG Teleconference minutes 2023-10-17

Congestion control (Harald)

Network may be limited. Sending too much causes discards and is bad manners. The browser will police the sender.

What about EncodedTransform? Without it, the transport estimates for you, telling the encoders what the target is. But the transform changes things: size changes. Therefore, the transform needs to know the target.

Proposal: add cancellable events for significant changes in available BW.

See slides for WebIDL and examples.

Jan-Ivar: I don’t understand the use case. It seems to be that we’re expanding the role of the application from simply transforming to something more. The user agent should already see how much data was added by the transform, why would we need to add BW information and allow modifying that? Is it justified?

Harald: Sending more data than there is room for is a bad thing, see previous slides. Letting the downstream decide what was added requires that the downstream can see both the incoming and outgoing sizes of the transform, and that the added outgoing information is consistent over time.

Youenn: I can see that you would want to override the user agent. But I think the user agent knows the overhead that the transform is adding already. So it can do something. What we need to understand is in which situations, the user agent behavior is not good. A transform that is doing a lot of things can drop frames. I’m fine with letting the web page influence this, but I am not sure how. If this API will be easy to use by web developers. It’s not clear how practical BandwidthInfo. But I think it is worth continuing the investigation.

Harald: So you think there is a case for onkeyframerequest, bringing that forward as a separate PR, correct?

Youenn: I think so, we should discuss it, but I see the use key and it is straightforward boolean. That’s much easier than the BW info, since there are multiple parameters.

Harald: So we have a controversial and non-controversial part, let’s separate. But in the case where the transform is not capable of doing the right thing is when the frame comes from another source. Because then the source might be in app control, but not under control of the encoder of the sender. So the sender might do everything it can of the encoder, but if the encoder is not the source of the frame, we’re in trouble. For that use case, we need something like this.

Bernard: Question about the flow of the BW info. You get it via the onbandwidthestimate?

Harald: We should fire events and let the user read the state of the event. You have to read BW info when you get the onbandwidthestimate.

Bernard: So you make a copy and then call sendBandwidthEstimate, correct? So it’s not actually setting on the wire?

Jan-Ivar: I’m trying to follow up on the sources. And part of my concern is that we don’t have consensus yet if this is the right API shape. But question about the events: so network can change, but one of the use cases is that metadata can change. Is this mean to be a signal that the user agent can use punitively on JS that is producing too much data? Or is this plain data?

Harald: The sender is allowed to drop frames, that’s something we already agreed on. But this is giving the information up so that the app can adjust with a high probability that downstream does not later drop the frame. The BW can change fast of course so there is never a guarantee. But if the transform adds metadata about for example the silhouette, or standstill, that you use for background replace, and it knows that this information will att 1 KB to the frame, then the transform knows it is changing from not sending to sending this data, then it can proactively tell the encoder this, because I will now add more stuff to your frames. This is why you might need to set this even when there is no signal from the transform.

Jan-Ivar: It might seem appropriate/useful to me to signal something about the caps to JS, but running events at 60 frames per second might seem unnecessary.

Harald: A lot of the time there will not be any change, so I think firing events is appropriate since it will only fire some of the time. You could read it just before deciding what to do, and that is perfectly reasonable.

Conclusion: Separate PR for key frame event. But maybe we don’t need an event for BW info, you can just read it. I can make those changes and come back in November.

Mediacapture Extensions

Henrik…

We have video frame counters, we should similarly add audio frame counter for the same reasons, like calculating percentage of frames lost (e.g. detect audio glitches). But in the audio case it’s also very interesting for audio quality to know about capture delay, so we should also measure the delay between capture and the audio frames being delivered. (Arguably you might want this for video too, but so far nobody has asked for it). So here is the PR. We might want to modify it to say totalFrames instead of droppedFrames (you can calculate drop from total by subtracting delivered) as this would be more consistent with the audio stats. But in general can we move on and merge this PR based on following up on this in the editor’s meeting?

Jan-Ivar: Paul’s not here but he put some comments on the issue that it would be great if you could look at. But overall I applaud the move and think this is good.

Discussion around naming, clarifications around what delivered means. But overall approach is not controversial.

Henrik: Delivered is when the frames are being handed off to the sinks. This is the exact same definitions as for the video frames.

Jan-Ivar: But this isn’t observable.

Henrik: No but it covers the part of the pipeline up to the sink. For example if there is a delay before that, and then you use a WebRTC peer connection, and the peer connection adds additional delay in encoding and sending, then WebRTC getStats would have to tell you about any additional delays there. So even though the exact delivery time is not observable, the capture delay is a hopefully a quite well understood concept if we clarify it, and since this is only an estimate anyway. I mean, if the user is experiencing 300 ms delay but the API says 50 ms, then that’s clearly a bad implementation and we should file a bug to make the capture delay more accurate.

Youenn: In the webrtc stats we talk about samples, maybe it would make more sense to talk about samples here too since audio and video is different, that may be what audio folks prefer.

Henrik: Actually audio frames is what I was asked to use based on Paul’s input and is consistent with other audio APIs. Also on a historical note, the webrtc stats using samples was a mistake, and there’s actually an existing note about this explaining how the samples are normalized on number of audio channels. So the webrtc stats using samples is actually misleading and not what is actually measured there, so we should use frames.

Conclusion: Overall approach makes sense, flesh out the details in the editors meeting

Grab bag: Racy devicechange event design has poor interoperability (Jan-Ivar)

Problem: enumerating devices can take 100+ ms, where the device change event and enumerating gets out of sync. It becomes hard to reason and trial and error coding, which eventually passes QA, but it could pass for unintented side effects and fail in other browsers.

Proposal: Include devices as a parameter to the ondevicechanged event.

Youenn: I’m wondering if we could try to deprecate enumerate devices, but anyway, we have talked about trying to be more explicit about why the event fires (Jan-Ivar: that’s the next slide).

Harald: So this means that when you fire the event, you have already done the work to enumerate all devices, so it would probably fire later than today?

Jan-Ivar: I think the way we have written the the algorithm that information should already have been acquired, but yeah, otherwise there would be a delay.

Harald: I think that’s ok, it might lead to less firing events.

Guido: I’m ok with the change too.

Conclusion: No objection.

Grab bag: Should devicechange fire when the device info info changes? (Jan-Ivar)

The spec says to fire when the set of devices available to the user agent have changed. But user agents already lie about this and fire when mono changes. Or based on getUserMedia. So the question is, should we change the spec here or should we change Safari?

Proposal A, B, C (see slides). I think we should go with proposal A which is no change.

Youenn: You say it’s not web compatible, but somehow we shipped it, so it’s not clear it’s not web compatible. The spec is saying that the user agent if it has access to some devices, and you could see a world where the user agent does not have access to any devices until the OS has been prompted about wanting to use the devices, so I think in that sense Safari is following the spec.

Jan-Ivar: Should we not have an event that makes auto switching easy?

Youenn: Yes that is interesting and we could have an event that said why it fired, that might be much easier to understand to web developers. But if it can be solved with enumerateDevices, then that is fine as well. But I think Safari is following the spec.

Jan-Ivar: Do you have a preference?

Youenn: Hmm. (Needs to think)

Guido: I’m more inclined to make the change more generic than devices available to the web page. But my main concern is what the current wording is about set of devices available changes to the user agent. What if the label changes? Is that a change to the list of the devices or not? I would like the event to fire if there is any change (such as label for the sake of argument) if the result changes. Anything that changes the result should fire an event. What do you think?

Jan-Ivar: Is that proposal C?

Guido: Well not necessarily. Because you focus on the case where when you call getUserMedia the set of devices changes. I’m not against firing it in that case, I’m inclined to fire on any change available to the web page. But what needs to be clarified is what a change to the set of devices means.

Jan-Ivar: OS level changes for example? In Safari’s case it is the user agent changing the labels.

Guido: So what does “set of media devices has changed” mean? One interpretation is that anything in the devices changed, is it the number of elements or is it any change in the elements that is already there? My main concern is that I want the event to fire if anything changes, not just the number of devices. Can we update the wording?

Jan-Ivar: That might be OK, it’s probably quite rare, I’m not that concerned, but that would be OK. My main concern here is if Safari is right in firing the event.

Henrik: Arguably the set of devices is not very relevant to the JS app, from the JS app the only thing that matters is what the result is and if that result changes. So if one browser changes the set of devices but another browser doesn’t, even though that is different behavior, it isn’t necessarily a web compat issue since only in the browser that something changed do you need the app to respond, so as long as the event firing is consistent in the browser it should hopefully make sense.

Jan-Ivar: But the app may want to know if the user plugged in a device, and now it could fire in some browsers without that changing. Maybe prompt or not.

Youenn: The user agent is in a good spot to think the user might want to switch. There are cases where you may or may not want to auto switch, for example the aidpods might get automatically connected by being placed closed to the macbook, so maybe we could expose more information to the app instead.

Jan-Ivar: Perhaps we could have a device plugged event?

Conclusion: More discussion needed, but we seem to agree on the problems we need to solve.

Exposing decode errors (Philipp Hancke)

This was discussed at TPAC, generally in favor of exposing on the RTCRtpSender/RTCRtpReceiver rather than on the peer connection. Makes sense? -Yes.

A PR is presented adding RTCRtpSenderErrorEvent extending Error.

Henrik: Spatial index and encoding index is not the same thing. You could have an encoding with multiple spatial indices, e.g. L3T1, or you could have three encodings with L1T1, these are different things as we have one or multiple encoders.

Philipp: But spatial index is already used in other places.

Henrik: If so that is a mistake, I have been cleaning up the code base with a lot of places where these two things get mixed up, so we definitely should not duplicate this confusion to new APIs.

Bernard: WebCodecs issue 669 describes the approach in WebCodecs,, which is to use EncodingError for an issue with data (e.g. decoder can’t parse) and OperationError for resource issues.

Philipp: We specifically want to know about SW fallback.

Youenn: Possible privacy issue. I think we should have a fingerprint on the PR, and ask the PING people.

Florent: You can imagine an application abusing the different settings already if it wanted to do fingerprinting, so this is not necessarily unique.

Youenn: If so we should add a note if we missed something earlier.

Jan-Ivar: I’m confused, when should applications fire this? Only SW fallback, or other reasons?

Philipp: There are multiple reasons for falling back to SW. That’s one of the main reasons for wanting SW fallback. For example HW queue gets overloaded.

Jan-Ivar: It would be good to express what the app is expected to do in response to the event.

Henrik: Is always SW fallback?

Philipp: There can be cases where SW fallback.

Conclusion: More clarification and discussion needed.

setCodecPreferences vs unidirectional codecs (Philipp Hancke)

Some codecs, H264 profiles in particular, are send-only or receive-only. The algorithm says to look at send and receive directions, but it does not take the transceiver direction into account.

We need to take directionality into account. But if we do we need to throw an exception in setCodecPreferences or setting direction if we get to an incompatible state.

Youenn: Can it say “I want this codec but only for send”. Hm, no.

Florent: Having the direction would make sense, I think a lot of issues comes from codecs that cannot be decoded. I wonder if we should have more restrictions in how we handle codecs. Maybe have send codecs be a subset of receive codecs. It would fix some issues.

Harald: I think we should take directionality into account. I encountered some of the same problems when specifying the SDP negotiation PR we’ll get to later. We’ll need to look more into how JS has access to send and receive codecs separately.

Henrik: We already have that, with sender and receiver getCapabilities.

Bernard: I think in general it is ready for PR, then we can look at the PR and evaluate it.

Florent: You can still run into issues later on. (?)

Conclusion: Philipp will provide a PR

SDP codec negotiation (Harald)

The codec information needs to be presented before the negotiation. What I proposed was to make it possible to inject names that the platform does not understand. There is no need to change the SDP rules, that’s important.

Before sending data we have to choose which encoder to use. And if we transform the data we need to tell the transform what payload type to use, because it is different from the incoming PT. The new PT does not have to be understood by the platform (app level PT). Similarly on the decoder side you need to tell the depackatizer and decoder what PT to use. This PT the platform does have to understand.

Adding an API for this plugs in a glaring hole. The issue we need to fix was presented in March, the first version of the PR was presented in June, with conclusion to adopt PR with details to be discussed. Presented again at TPAC, the summary said that there were arguments on both sides. Suddenly packetizers are up for discussion. So I have revised the PR based on my understanding of the TPAC discussion. Given the problems that needs to be fixed, this approach seems to make sense.

But then at the editor’s team started discussing abandoning this approach altogether. I said no, that is not reasonable. We have discussed this for 6 months. There are a number of properties to this solution that are the way they are because of specific requirements and needs, which would not have been addressed by the alternatives discussed during the meeting.

7 months and 3 presentations should be enough to get consensus to at least try it. This hampers progress. Can we move forward or do we need to abandon? If we can’t agree on anything on any amount of time, then this working group is a failure. So can we move forward?

Proposal: Instruct the editors team to merge the PR.

Bernard: From a process point of view, the WG already said to move forward. It is the job of the editors team to implement what the WG decided.

Jan-Ivar: But this hasn’t had a call for consensus. I think some of this, I support solving. But there appears to be 3 use cases: e2e encrypt, metadata to existing frames, and third is codecs in JS. I think good arguments have been made that this is not the right API for number three. And other proposals with less methods were discussed in the issue.

Youenn: I think we agree that we should cover these issues and I am really hoping that we can do it. I don’t have a strong opinion if we should expose JS encoders and plug them into a peer connection, we can investigate it, I think there are use cases. But I think that belongs to a different spec about how to expose encoders and packatizers. We could use such APIs in totally orthogonal ways.

Philipp: Developers are already using transforms to solve real world problems and the working group is not addressing the use cases in a way that works better so developers will continue what they are doing.

Harald: For the JS encoder, that part is not my favorite API either - we should make webrtc encoded frames constructable, which is a completely different part - but no matter what we do on how to construct the frames, we still need to SDP negotiate. This is the simplest possible API for this. Jan-Ivar’s suggestion has obvious deficiencies and would probably take another 7 months to discuss it. This is unreasonable. We are better served by merging and iterating.

Jan-Ivar: I think my proposal is still an iteration on Harald’s API. It’s just removing methods for use case 3, so I just isolated it to the part that it should solve. I think it’s a compromise we could work on.

Youenn: If the user agent does not know the encoder

Henrik: It seems like there is a lot of concern about the API being misused. But I think we need to do all of these things anyway, like if we’re transforming then it is no longer the same PT and we are breaking SDP rules, and if we want to forward frames then we need to say what the PT is etc. If this API can be used for something else, and in the future there is a better way to do that, then that’s good. I really don’t understand why this is so controversial if we have to do something like this anyway, and it doesn’t make sense to me that we’re stalling for 6 months.

Jan-Ivar: But I hear consensus around some of the use cases, but it seems very hacky to add some of these things.

Youenn: I agree we should fix E2E encryption. We just need a single attribute. But if we can’t agree on that, well that’s life. Taking a frame from one peer connection and putting it in another peer connection is very similar to a pluggable encoder, so exposing an encoder might make more sense to solve this problem.

Henrik: Is the core of the disagreement that we’re enabling pushing frames without reading from the source when the app is in more control?

Harald: No this is about the SDP negotiation specifically.

Harald: What is the next step? Call for consensus?

Bernard: We can do a call for consensus. But in the past it has resulted in a bunch of issues that can take months to resolve.

Jan-Ivar: It doesn’t seem like we have consensus.

Bernard: But it may help clarify what we don’t have consensus about.

Jan-Ivar: If we can entagle these things for use case 1, 2 or 3 then maybe we can make progress.

Conclusion: Try to solve use case 1 and 2 first, and then revisit use case 3 separately.