Joint Media/WebRTC WG meeting at TPAC

Meeting minutes

Introduction

Bernard: We'll talk about new things today, involving WebCodecs, what additional things people want, gaps, enable wider use
… We had breakouts yesterday
… We'll cover those topics here
… RtpTransport breakout discussed custom congestion control
… WebCodecs and RtpTransport on the encoding side
… Sync on the Web session. Interest to sync things like MIDI
… Any comments?

(nothing)

Reference Control in WebCodecs

Erik: Reference Frame Control, to repeat the breakout session. And Corruption Detection
… Reference Frame control, the coal is to implement any referene structure we want. As simple API as possible
… Make the encoder as dumb as possible
… Use a s few bits as possible, don't get into how to do feedbacl etc

Eugene: We propose a new way to spec scalability modes for SVC
… This allows any kind of pattern of dependencies betwen frames
… Most encoders have a frame buffer abstraction
… For saving frames for future use
… getAllFrameBuffers() returns a sequence of all the FBs
… No heavy underlying resources
… Lets us extend video encode options, so say which frame goes to which framebuffer
… Signals in whcih slot the frame should be saved
… And dependencies between them
… This is only available under a new "manual" scalability mode
… Chromium implemnted behind a flag for libav1, hopeful for libvpx, HW accel on Windows under DirectX12

Erik: Concrete example of how to use it. Three temporal layers
… Dependencies are always downwards
… We create a VideoEncoder with "manual", check the encoder supports this mode, the check the list of reference buffers, then start encoding
… There are 4 cases in the switch statement.
… To make this work, we had to make simplifications and tradeoffs
… We limit it to only use CQP

Bernard: Can I do per-frame QP?

Erik: Yes
… You have to do per-frame QP at the momemnt, CBR is a follow up
… If the codec implements fewer reference buffers than the spec
… Don't support spatial SVC or simulcast
… We limit to updating only a single refernece buffer for a single frame today
… H264 and H265 have more complex structure how they reference things. We model only with long-term references
… We have some limitations aroudn frame dropping
… To summarise the breakout, most people seem supportive
… We want to take this a step further, support more codecs, user needs to understand the limitations, so need to query, need to discuss an isConfigSupported() or MC API
… Fingerprinting surface, not really new, just a more structured way to look at data already there
… Need examples. Can do L1T3 today, need examples for what you can't do today

Jan-Ivar: There's a videoframebuffer id?

Eugene: Wanted to make it more explicit from a type point of view. The spec in future can say take buffers from a particular encoder instance. Can't take from strings
… It's a closed cycle

Jan-Ivar: Just a bikeshed, strange to havesomething called a buffer that isn't actually a buffer

Eugene: Open to renaming, e.g., add Ref at the end?

Erik: It represents a slot where you can put something, not the content

Bernard: The reference has to be the same resolution?

Eugene: Don't have anything for spatial scalaibilty, each will have a separate buffer
… We wanted to have this interface, introduce spatial scalability in future

Bernard: Can do simulcast, but in the same way as WebCodecs, creating multiplke encoders
… WebRTC can have one encoder do multiple resolutions

Corruption Likelihood Metric

Erik: Detecting codec corruptions, during transport etc that lead to visible artifacts, pixels on screen with bad values
… Add a new measuremtn that tries to capture this, using as little bandwith and CPU as possible
… One implementation in mind, use an RTP header extension as side channel
… You randomly selct a number of samples in the image and put into an extension header
… The receiver takes the same locations in the image, look a t the sampe values they see. If they differ, you have a corruption
… Not just a single sample value. You'll have natural distortions from compression, want to filter those out
… With QP, take an average around a location
… Don't want stats value to be coupled to this partucular implementation
… Allows us to do completely receive side, e.g, with an ML model
… Proposal to put it in the inbound RTP RTC states. Could be put in VideoPlaybackQuality. Same thing could apply to any video transmission system
… Looking for feedback or input

Cullen: Sympathetic to this use case, concerned about the details. Concern about RTP header extension, doesn't get same security processing as the video, could reveal a lot of info, e.g., guess what the video was
… Privacy concern

Erik: That's correct. We'll rely on RFC6904 to do encryption of header extension in the initial impl
… Other wise you leak a small portion of the frame

Cullen: If you trry to sample a screenshare video, large regions of black or white. Metrics for video quality, considered other options than just a few sampling points?

Erik: Yes, screen content is difficult, doesn't generalise as well.
… With 13 samples/frame it's good at finding errors

Cullen: How many samples are you thinking of using?

Erik: 13 actual samples we transmit

Harald: Thought about adding to VideoFrameMetadata instead of Stats?

Erik: That's the issue of exposing up to the application level. Won't do on all frames, maybe 1 frame / second. Could involve a GPU to CPU copy, so want to limit that
… Open to ideas on how to surface to the user after calculation

Harald: Sounds like we need to experiment

Bernard: The implemetations didn't work, header extensions sent in clear, so privacy issue if not fixed
… Want to think beyond WebRTC - MoQ, etc. Think about making meteadata, e.g., playout quality, to get it multiple ways

Erik: The 6904 is a stop-gap to start experimenting
… Not sure how to transmit in a general way the samples end to end

Bernard: Previous discussion on segment masks, metadata attacked to the VideoFrame

Erik: Please coment in GitHub

Youenn: Hearing it's good to experiment. This can be shimmed, transform to get the data you want
… Considered doing that first, and would that be good enough?

Erik: Considered doing encoded transform, the QP is missing
… On a native level you can adapt the thresholds to get better signal to noise
… We do local experiments in the office, but want to see from actual usage

Audio Evolution

Paul: We'll talk about a few new things, some are codec-specific, some not
… Two items to discuss. New advances with Opus codec - 1.5 released this year, has new booleans we should take advantage of
… And we can improve spatial audio capabilities. For surround sound, music applications, etc
… Link to blog post that talks about the new features
… Some are ML techniques to improve quality under heavy packet loss
… With LBRR+DRED you get good quality with 90% packet loss
… To use recent Opus quality improvements, there's a decoder complexity number. In Opus codec you can trade CPU power for higher quality
… If you have quality (0-10), if >=5 you get Deep PLC, very high quality PLC
… If 6 you get LACE, improves speech quality
… NoLace is more expensive on CPU
… Need a few megabytes. Not complex, geared to realtime usage
… Only works with 20ms packets and wideband bandwidth
… You'd have a complexity config
… It's decode-side only, no compatibilty issue
… DRED - Deep Redundancy, you put latent information in every packet, can use the data in packet received to get the data you should have received
… Increase jitter buffer size, then decoder reconstructs. Requires change of API on encoder side. Reconstruct PCM from what you didn't receive
… New parameters when you encode the packet. Bitstream is versioned, so it will be ignored safely and not crash the decoder

Bernard: It's not trivially integrated in WebCodecs. What to do?

Paul: Add a second parameter to decode, with a dictionary, to enable this recovery scheme. It would be backwards compatible

Harald: Does this introduce additional delay?

Paul: The second technique that can reconstruct heavy packet loss .. works like this
… On detecting packet loss, you increase latency a lot, up to 1 seocnd. If it continues like that, you an still understand what's said
… If network conditions approve, go back to normal

Erik: Is the 20ms limit just with current implementation?

Paul: They say "currently", not clear in the blog post why it is

Erik: Typically you want long frame lengths

Eugene: Slide 36 says 2 new APIs needed. What are they?

Paul: One is indicating there was packet loss, but need something for where packet loss happened

Eugene: Feature detection

Paul: IF the decoder doesn't understand the additional info, it's skipped
… If you change version, it won't break. That's designed into the bitstream
… Enable in the encoder, with DRED=true

Eugene: Don't need a configuration parameter

Paul: Affects encoding schems

Bernard: Config parameters in the scheme, some might affect WebCodecs

Improve spatial audio capabilities

Paul: Opus can now do new things. Opus is mono and stereo, them they tell you how to map multiple channels
… If ithe bytestream has channel family 2 and 3, it's ambisonics. Use orientation and trigonometry maps you can reconstruct audio from different directions
… Straightfoward to decode
… Trig maps can be done by the browser at this point
… Just need to what mapping familiy it is
… 255 is interesting, can have up to 256 channels, you know what to do. Have an index, do custom proessing in WASM
… App layer and the file need to udnerstand each other
… Web uses a certain channel ordering, in Web Audio API
… Propose remapping channels, so you have a consident mapping regardless of codec and container
… It's now specced in Web Audio

Paul: Proposal is to map everything to SMPTE. AAC would need remapping, but others not touched
… With ambisonics, decode and the app does the rest
… For decode and output, don't think the app should be doing that
… Proposal is almost do nothing, just remap so considetnt between the APIs
… Any concerns?

Harald: How to map multiple channels in RTP?
… Need to tell which channels are coupled and which are mono
… Some implementations have something, not standardised

Harald: There are hacks, yes

Paul: So long as consistent with web platform, get channels in the index you expect, so don't have to tune the app code for different codecs

Jan-Ivar: What about on encoding? Also use SMPTE there?

Paul: On the web it's supposed to be that audio

Jan-Ivar: If you want to play it, not all platforms will spport SMPTE playback

Paul: You'd remap the output

Bernard: In response to Harald, nothing in the formats from AOMedia. How to get it into WebRTC?

Paul: Thsi is about getting into WebCodecs, then figure out SVP
… There are draft RFCs about it

Bernard: THere's no transport for the container

Weiwei: There are sevaral spatial audio codec standards. Does it work with them?

Paul: All will be mapped to this order, but that's been the case for some time. Need to ensure all specs agree

Weiwei: In China, there's a spatial audio standard, will it work for them?

Paul: If you have channel info at the decoder level, you can remap and expose in the order you expect

Weiwei: We should look into it

IAMF

Paul: IAMF and object based audio, how to deal with it on the web?
… Web Codecs doesn't concern itself with the container
… Do we feel that using low level APIs for decoding the streams is enough, then render in script?
… DSP involved isn't complicated, just mixing, volume, panning

Eugene: Agree this is an advanced feature, so leave to the app

Bernard: More complicated than that. Things like the Opus 1.5 extensions
… IAMF can work with lots of codecs, but they want to do additional stuff

Paul: In that case, want to have WebCodecs work with it. Don't know if we want WebRTC WG do the work, more complications

Encoded Source

Guido: In WebRTC WG we want to support the ultra low latency broadcast with fanout use case
… UA must be able to forward media from a peer to another peer
… Timing and bandwidth estimates for congestion control
… Specifically, we want to support this scenario, where you have a server that provides the media. Large number of consumers
… Communication with server is expensive
… Assume communication between nodes is cheaper than communication to server
… Nodes can appear or disappear at any time
… Example, Two peer connections receiving data from any peer in the network.
… Use encoded transform to receive frames
… Depending on network conditions, you might want to drop frames
… When app decides what to forward, sends frame to multiple output peer connections
… Idea is you can fail over, be robust without requiring timeouts
… So can provide a glitch-free forward
… We made a proposal, patterned on RTCRtpEncodedTransform
… This is similar to single-sided encoded transform
… Got WG and dveeloper feedback tha we've incorporated
… Allows more freedom than encoded transform. You can write any frames, so it's easier to make mistakes. Wuold be good to provide better error signals
… It's less connected to internal control loops in WebRTC
… In addition to raw error ahdling we need bandwidth estimates, etc
… Basic example. We have a worker, a number of peer connections
… Each has a sender. For each sender we call createEncodedSource()
… This method is similar to replaceTrack()
… On receiver connection, we use RTCRtpScriptTransform
… On worker side, we receive everything, we use encoded sources. In the example, source has a writeable stream, a readable and a writeable
… For the receivers, can apply a transform
… Write the frame to all the source writers
… You might need to adjust the metadata
… Errors and signals that developers say would be useful include keyframe requests, bandwidth estimates, congestion control, error handling for incorrect frames
… e.g, timestamps going backwards
… Other signals are a counter for frames dropped after being written that the sender decided to drop
… Expected queue time once written
… To handle keyframe requests, there's an eent
… .writeable stream, and event handler for the keyframe request
… For bandwidth we're proposing to use a previous proposal or congestion control from Harald
… Recommended bitrate
… Outgoing bitrate is already exposed in stats, convenient to have it here
… Have an event that fires when there's a change in bandwidth info.
… [shows BandwidthInfo API]
… Use with dropped frames, after written, and expected send queue time
… if allocated bitrate exceeds a threshold, add extra redundancy data for the frame
… [Shows API shape]
… Pros and cons. Similar pattern to encoded transform, simple to use and easy to understand
… Good match for frame-centric operations
… Allows zero timeout failover from redundant paths
… Easy to adjust or drop frames due to bandwidth issues
… It requires waiting for a full frame
… In future could be ReceiverEncodedSource
… Have fan-in for all the receivers

Jan-Ivar: In general I agree this is a good API to solve the forwarding data use case
… Seems to be a bit more than a source. Somehting you can assign to a sender in place of a track
… Once you associate a sender with a source, that can't be broken again?

Guido: Yes. A handle-like object
… I like it better with a method, can play track with a video element. But with this object there's nothing you can do with it
… There isn't a lot of advantage to having this object, e.g., to send to another sender
… We can iterate on the methods and finer details
… I prefer methods, as they create the association immediately

Jan-Ivar: That ends up being a permanent coupling

Guido Can create and replace an existing one

Jan-Ivar: The permanent coupling ...

Guido: It's just an immediate coupling
… You can decouple it
… Can do the same approach as encoded transform if we think that's better

Youenn: Overall it's in the right direction. Similar feedback on the API shape, but we can converge
… Not a small API. Good to have shared concepts
… Encoded transform was very strict. Here we're opening the box. Have to be precise about error cases
… We're opening the box in the middle. Need to be precise how it works with encoded transform
… Improve the API shape and really describe the model and how it works. Implications for stats, encoded transform, etc.
… I have other feedback, will put on GitHub
… Let's go for it, but be careful about describing it precisely

Guido: So we have agreement on the direction

Harald: encoded transport has bandwidth allocation. Should try to harmonise the other part

Timing Model

Slideset: https://docs.google.com/presentation/d/1d5KdKhwd8PGkGJweJl0qDX9yy9s2GqDRH4EpQ0ccr5g/edit (archived PDF copy)

Clarification needed for HTMLVideoElements that are playing a MediaStream

[Slide 60]

Harald: Recently developed we added stats countters to MediaStreamTrack, and should be reflected. Either that shouldn't exist or be consistent

youennf: we should be able to compute video playback counters based on the track

Youenn: We should define one from each other. Take the MST definition and define VideoPLaybackQUality in terms of that

handell: there's a different proposal that has total video frames in it.

Marcus: I have another prosal that would increment total video frames. So lean to proposal 2

Harald; So sounds like we should spec the behaviour. We're trying to unify tha statess across sources

bernard: we should try to specify it.

Bernard: Suggest we do it more generally via tracks, which is more work

Chris: Where to spec?

cpn: Agreement to try to specify behavior - within Video Playback Quality?

youennf: each source should describe how they are creating video frame objects.

Youenn: You have different sources in differents specs, describe how they create video frame objects, and have defintions of countrs as well

dom: need a burndown list of fixing all the specs to supply that info.

jib: agree we need to define them for each mst source

What is the timestamp value of the VideoFrame/AudioData from a remote track?

[Slide 61]

Bernard: Timestamp is a capture timestamp, not a presentation timestamp. Shuold we change the definition in WebCodecs? Can we descirbe more clearly this and rVFC timestamp?

Eugene: For video file, there's only presentation timetamp
… For camera, it's capture timestamp by definitition. Needs to be source-specific

Bernard: Where would you put the definitions, media-capture-main?

Youenn: Yes

Marcus: In WebCodecs spec, there's no definition other than presentation timestamp. In chromium, starts at 0 end increments by frame duration
… It's unspecified what it contains. We have a heuristic in Chromium that puts the capture timetamp
… It's sourced up to rVFC
… Shouldn't really be like that, it should be a presentation timetamp

Randell: The use of the terms presentation and capture timestamp is a bit arbitrary
… The fact it comes from a file and is a presentation timestamp, and from a capture is capture timstamp, isn't relevant. Just have a timetamp

Bernard: Want to move to next issue

[Slide 66]

Add captureTime, receiveTime and rtpTimestamp to VideoFrameMetadata

Marcus: Web apps that depend on the timestamp sequence, we want to expose capture time into VideoFrameMetadata
… Why? Capture time is async, and enables end to end video delay measurements
… In WebRTC, we prefer the ? timetamp
… The capture time in this context is an absolute measure
… Presentation timestamps not clear when were measured
… Capture time can get the time from before the pipeline
… There are higher quality timetamps from the Capture APIs, we want to expose them
… PR 183 adds those,we refer to the rVFC text
… People didn't like that. Now we have 5 PRs
… We're trying to define this concept in media stream tracks. I'd place those in mediacapture-extensions
… webrtc-extensions, and mediacapture-transform, then repurpose #813 to add fields to VideoFrameMetadata registry
… That's the plan

Eugene: The main problem was video/audio sync
… Audio frames captured from the mic had one timestamp, and video frames from the camera had different, and was confusing encoding configurations
… Change made for videoframe timetamp to be capture timestamps is important change. Currently just Chomium behavior, want it to be specced behaviour
… So you can do A/V sync, otherwise any kind of sync is impossible

Paul: Reclocking, skew, compensate for latency, so everything matches

Eugene: : Why not have the same clock in both places?

Paul: You take a latency hit as it involves resampling

Marcus: We don't have ??

<hta> ?? = AudioTrackGenerator

Paul: There is reclocking happening, otherwise it falls apart

Eugene: Need example code to show how to do it correctly, for web developers

Paul: Sure

[Slide 71]

What is the impact of timestamp for video frames enqueued in VideoTrackGenerator?

Youenn: VideoTrackGenerator timestamp model isn't defined
… Not buffering anything. Each track source will define
… Timetamp not used in any spec on the sync side
… We define timestamp per track source
… Video track sink, there's a diffenrce between webkit and Chromium on implementation
… If spec says nothing, means we don't care about the timestamp

[Slide 72]

Bernard: Are those statements about what happens true or not?

Harald: Video element has a jitter buffer

Bernard: So the staments seem accurate.

[Slide 73]

Expectations/Requirements for VideoFrame and AudioData timestamps

Bernard: What if you append multiple VideoFrames with the same timetamp? Does VTG just pass it on, look for dupes?

Jan-Ivar: Yes, garbage-in, garbage-out

Youenn: It's the sink that cares about the timestamp

Bernard: Something to make clear in the PR

Jan-Ivar: Need to consider someone sending data over the channel

[Slide 74]

Playback and sync of tracks created by VideoTrackGenerator

Bernard: HTMLVideoElement. no normative reuirement, might happen or ot
… Describes issues with losing sync, need to delay one to get sync back, etc
… Want to be more specific about this. A jitter buffer potentially in HTMLMediaElement. How does it work and what does it take into account?
… It's suggested it's more difficult for remote playout. In RTP, it's used to caluculat the sender/receiver offset
… What's going on inside the black box?

[Slide 75]

Youenn: Is it observable? With gUM, the tracks are synchronised. In other cases, we have separate tracks

Jan-Ivar: Depends where the source comes from. MediaSrram is a generic implementation for different sources

<hta> Jan-Ivar: Very old language on synchronization might be outdated.

Bernard: Thinking about remote audio and video. Need receive time and capture time from same source

Harald: Looked at this code recently. For a single AudioTrack and VideoTrack from same peer connecton with same clock source, WebRTC tries to synchronise

Marcus: MediaRecorder sorts samples

Paul: Similar in FF if you have multiple microphones, we consider it a high level API so it should work

Youenn: Spec should clarify there are some cases you should do it, other cases it's impossible

Bernard: If I'm writing a WebCodecs+WebTransport, is there someting i can do to make it work?

Paul: Implement jitter buffers

Marcus: If you have capture times from all streams, you can sort in JS

Youenn: make sure from same device

Jan-Ivar: If you have VTG, would it affect playback?

Bernard: You have capture time from sender

Chris: Next steps, schedule more time for this discussion?

Bernard: Good idea, yes

– DRAFT –
Joint Media/WebRTC WG meeting at TPAC

26 September 2024

Attendees