W3C

– DRAFT –
Media and Entertainment IG

10 December 2024

Attendees

Present
Ali_C_Begen, Bernd_Czelhan, Casey_Occhialini, Chris_Lorenzo, Chris_Needham, Daniel_Silhavy, Eric_Carlson, Francois_Daoust, Hisayuki_Ohmata, Jer_Noble, John_Riviello, Kaz_Ashimura, Mark_Young, Nigel_Megitt, Ryo_Yasuoka, Tatsuya_Igarashi, Thasso_Griebel, Yuriy_Reznik
Regrets
-
Chair
Chris_Lorenzo, Chris_Needham, Tatsuya_Igarashi
Scribe
cpn, kaz, tidoust

Meeting minutes

Slideset: https://www.w3.org/2011/webtv/wiki/images/b/bb/2024-12-10-MEIG-SVTA-MSE-Meeting.pdf

Introduction

Chris: This is a joint meeting with W3C and SVTA.

Daniel: We're SVTA / DASH-IF members, we want to share feedback around MSE, in the context of media player implementations
… Pain points and issues we see as developers. Maybe we're doing something wrong, you could provide your feedback
… Improve existing implementations, and we want to understand what you're working on: EME, other APIs
… I'd like to join calls more frequently in future

[Slide 3]

Daniel: We want to discuss each topic in turn

[Slide 4]

Thasso: I work for CastLabs, leading the player team there. We have experience dealing with MSE etc

Daniel: I'm with Fraunhofer Fokus, lead developer of Dash.js, and co-chair of the SVTA players and playback WG
… Ali and Yuriy are chairs with me

Chris: This is Media & Entertainment IG, which focuses on industry coordination and use cases and requirements. This group can't develop specs.
… I also co-chair Media WG, which does the standards-track spec development for MSE, EME, WebCodecs, Media Capabilities, etc.

[Slide 6]

Daniel: We have various groups and subgroups in SVTA. DASH-IF and SVTA merged

Discussion items

[Slide 8]

Daniel: We have a general structure for the discussion: how it's working today, implementation issues and implications, and workarounds in players, suggested improvements to MSE and related use cases

Buffer capacity

[Slide 9]

Daniel: Every media player buffers data. Create SourceBuffers and append data
… The app can define the size of the forward and backward buffers
… The forward buffer has a trade off with latency
… A limitation we have is memory of buffer capacity. There's no API to query how much data we can append to the buffer
… We schedule a request for a media segment, but we get QuotaExceeded if there isn't sufficient capacity
… What would improve the behaviour is to have a way to query the capacity, then we can delay appending and the fetching of the segment
… What we do today is way for the error event, then reduce the max possible buffer
… And adjust the backward and forward buffer. Would help every player, with downloading segments

Jer: I wrote the MSE player implementation in WebKit. Do you want a general idea of how much room is available without doing appends first? Are you asking for remaining buffer size or something more general?

Daniel: I'd be fine with total buffer size. It depends on the segment duration. Giving a feeling of how much data I can append, and combine with bitrate info, would help understand if I can append or not

Nigel: In implementations, is the buffer size constant, or does it vary over time during playback?

Jer: In our implementation, it's somewhat constant. But wouldn't want to design ourselves into a corner. WE have to deal with requests from the system to jettison memory, which motivated Managed Media Source
… I would not want fixed buffer size to be a requirement in the spec

Nigel: Does that imply it's preferable to return how much space there is right now?

Jer: That answer isn't a guarantee. The system could detect a low memory condition and that would change the answer
… Any such API couldn't provide a guarantee

Eric: And we wouldn't want an API that encourages apps to poll

Thasso: On fixed buffer size, we used to have an implementation where the buffer size was dynamic. Having the ability to deal with dynamic buffers is something that needs to be supported from player perspective
… How would it be expressed to the client? Media time, memory?
… For my use cases, I'd want to poll this infrequently, just before downloading the next segment

Daniel: I had the same comment. If you schedule the request, you can decide if you want to query data

Jer: What's the expectation, when you hit the memory limit?
… When we designed the APIs, when you get QuotaExceeded, you purge the back buffer to make room for the forward buffer. That wouldn't change this
… I'm not sure what the benefit would be. You have the data in JS, and as you reach the end of the forward buffer, you need to append the downloaded data to prevent a stall
… It shouldn't be so expensive you can't do it a few seconds ahead of time
… If you have at least a minute of forward buffer, and as you get close to the end you'd have to purge the backbuffer to append more data. Is that a problem? Do you want us to handle purging the backbuffer for you?

Thasso: No, but would be fine with it. This goes in the direction of MMS. I could be OK with depleting certain parts of the buffer not others
… Most of my pain points are with TVs and STBs
… We discuss frequently we like having MediaSource buffers we can retain in memory, and we'd rather not have another buffer implementation client side
… The machine needs the memory in the end, doesn't matter if in JS or in MSE side
… Want to avoid splitting it into the two worlds
… MSE appends take time, so finding the right moment can be challenging, and time could be 2 or 200ms before the frame can be rendered

Jer: I'm mostly familiar with high powered devices, much more powerful than TV or STBs typically
… For our implementation, an append doesn't have to be an entire media segment. If you're trying to keep the forward and back buffers full, with our implementation you can break the buffer into pieces to keep playback uninterrupted
… Don't know about TV implementations, if they have enough memory

Jer: You can append parts of mdat and moov boxes, but it does require an init segment first

Daniel: We have CMAF low latency chunks, because what you get from fetch API doesn't align with CMAF chunks

Jer: The MSE parsing loop understands, and the reset step requires an init segment first
… No requirement that each chunk is entire

Thasso: A lot of implementations require full mdat box structures

Jer: There's no requirement for that, but some implementations might require it

Kaz: This proposed API could be harmful, for hackers to crash systems, so we should be careful and discuss pros and cons

Jer: Yes, as an internet exposed browser we worry about fingerprinting. If you have a dynamic buffer size API, you could use it for cross-site user tracking
… Could be like a super-cookie

Nigel: It's interesting you could append partial segment data to the buffer, but at the moment it feels like no client code would know that's a good idea to do
… If it's a strategy to fill the buffer as much as possible, and the client has a whole segment and half a segment would fit, there's no information back from the API to suggest that

Jer: It's not an unrecoverable error though. Some ideas. I could imagine relaxing the requirement, so if you exceed the quota, but you can't append until some flag is cleared
… If some implementations require a full buffer to be appended, an API could be more flexible with its buffer size requirements. Accept the buffer but don't allow further appends until a flag is cleared, e.g., by a remove command

Nigel: A call to append could return the number of bytes successfully appended.

Box parsing

[Slide 10]

Daniel: We append ISO BMFF boxes, many players have their own box parser, useful for the player or app
… Example is EMSG box, for use by player events
… As of today, MSE doesn't support parsing boxes and dispatching to the player
… It's done in JS. WASM could be an option
… With low latency streaming. We parse MOOV and MDAT boxes, we try to append complete MOOV+MDAT combinations
… Suggest an API to allow clients to register to receive the boxes
… EMSG, PRFT for latency adjustment, ELST. Have to parse MOOV to get correct timescale value

Jer: An arbitrary MP4 parser with WebCodecs lets you create your own player
… As long as the STB supports WebCodecs. WebCodecs provides low level access to audio and video decoders
… Render to a canvas, preferably GL-backed.

Chris: The question of box parsing has come up before in the context of WebCodecs
… Preferable approach was thought to be JavaScript as it offers flexibility, but also that JS parsing was considered performant enough to not need a browser-level API.

https://github.com/w3c/media-and-entertainment/issues/108 <- Media Containers API issue

https://github.com/w3c/webcodecs/issues/24 <- WebCodecs containers API issue

Chris: On emsg specifically, WebKit has the DataCue API. In this IG a while ago, we were looking at how would we do emsg parsing surfaced through DataCue events.
… That work kind of stalled. We didn't have enough active contributors pushing this forward.
… If people are interested, I would suggest to get together and get that moved forward.
… Very targeted solution towards emsg events. Immediately triggered or triggered at some point on the timeline.
… It wouldn't do the general box parsing that you're talking about, though.

https://github.com/WICG/datacue/blob/main/explainer.md https://wicg.github.io/datacue/ <- DataCue explainer and early spec draft

Thasso: We're very interested. Essentially, it means we end up with an MSE implementation that we do ourselves.

Thasso: A software implementation based on WebCodecs sounds a good idea. But we're still lacking a lot of features, e.g., DRM
… A simple approach, register a listener for any box type, don't need to do heavy lifting

Jer: I see a couple of problems here. An API to return an arbitrary box, especially if it's one the implementation doesn't understand
… There are use cases I'd like to address. EMSG is one of them
… Other case is 608/708 caption data, given regulatory requirements
… Those are embedded in the media stream, but not elevated to the subtitle rendering
… So we see websites doing parsing themselves. But that might not be accomplished using a box parsing API, they're muxed in the mdat

Nigel: At TPAC we talked about potentially adding subtitles and captions to MSE, but then the question is how do you know on the output side which mdats to pull out
… so you do what you need to for the player code.

<Zakim> nigel, you wanted to mention that this could be helpful for subtitle/caption decoding from MSE

Nigel: When you say register for ISO BMFF boxes, it's not any mdat you want, it's some particular mdat

Thasso: I agree, the general problem with not every implementation understanding the boxes, and issue with nested boxes
… Maybe we could say, for CMAF content all boxes defined there are supported by spec, so I can pull them out

Daniel: Suggest following up offline

Codec information

[Slide 11]

Daniel: You have changeType method. In dash.js we save the codec info in a variable
… Not possible to ask the current codec string, so you have maintain yourself. Suggest adding an API

Jer: We had an idea in WebKit, to pull codec information from the VideoTrack. It's relevant for MSE clients and for HLS and file-based downloads
… changeType requires passing a complete codec string. We've seen cargo-culting or magic strings being used for AAC or H.264. How do you know which codec string to use with Media Capabilities?
… Needs info out of band. An API to get the codec string as understood by the browser. Some interest from browser vendors to do this
… We've heard from other clients what they really want is a timeline based set of information: start, end, properties
… It's an interesting use case. We want to solve aspects of this. Please bring to the WG

Dynamic addition of SourceBuffers

[Slide 12]

Thasso: Could have a MediaSource session, and maintaining a number of buffers in the session
… Issue is inability to manage the number of buffers. Turn off audio, but I can only mute it. Once removed it's gone, get an error after adding it back
… Use cases: remove buffer: turn off audio fully, or turn of video fully
… A text track I definitely want to turn off. Some players have workarounds, difficult to maintain. For audio, pushing silence isnt' so complicated
… Modelling a black frame with H.264 is not too bad, but becomes more complex for other codecs
… We want more dynamic behaviour when adding or removing buffers

Daniel: The IETF MoQ group is looking at low latency, hard with current implementations if you need to append dummy data

Jer: Two related efforts in Media WG: One is behaviour when you hit a gap in video data, continue playing and catch up, don't stall. Would solve some of these use cases

https://github.com/w3c/media-source/issues/160 <- MSE playback through unbuffered ranges

Jer: Bigger issue, there's a solution for having multiple SourceBuffers in a MediaSource that aren't currently active
… Tracks associated with media element. Once it's removed from the active source buffers list, it should have no impact on playthrough
… Shouldn't have to feed black frames through
… I don't think Chromium has that yet. But it would unblock this use case
… It exists in the spec but not all implementations yet

Multiple Source Buffers

[Slide 13]

Thasso: Related use case. We implemented HLS interstitials, ran into problem. Conditioning not perfect, when timelines overlap
… Want to use MSE as our buffer, and make use of it later
… Problem is how to do this even with virtual buffer. If timelines overlap, need to be very precise. currentTime not accurate enough, every 250 ms. So workaround is to use rAF() and poll the time to work out when to append
… Want to get rid of the data earlier. Best case scenario, not do the switching myself but be able to schedule: when done with video track 1, play number 2, then go back to 1 if there a no gaps
… Hard to deal with overlapping timelines on the client side

Jer: A couple of ideas. You shouldn't have to poll currentTime, you can use synthetic TextTrackCue events for example

Thasso: We've tried things like that, but the timing is not accurate on all implementations

Jer: We've heard this use case before, with HLS interstitials. A MediaSource you can detach and re-attach later. Designed to solve use case of switching to differently encoded content.
… Could be used to play interstitial content, without having to reappend the original data. Only requirement is to seek back to the main timeline position when you do the switch
… The issue from implementers we heard is there may not be enough memory in low-end implementations to support multuple MS instances. Multiple video buffers would have similar problem, leading to more QuotaExceeded errors

Thasso: I think the limitation on embedded devices isn't necessarily the memory, it's how they initialise the hardware resources

Daniel: That's why we did the virtual buffer in dash.js

Jer: Detachable MediaSource. You have main content attached to the media element. If you want to preload ad insertion in a second MediaSource
… Use an audio element instead, to avoid the implementation instantiating an embedded codec. It's technically allowed by the spec, but needs some experimentation on STBs to see if it would work
… And would require an implementation of datachable Media Source

Summary

[Slide 16]

Daniel: I want to join calls more frequently, and we can file GH issues

Chris: Thank you for bringing these topics, we'll follow up, your input is welcome.
… You mention potential presentations at the OSMART workshops, happy to do something like that.

Daniel: yes, let's talk about that offline as well

[adjourned]

Minutes manually created (not a transcript), formatted by scribe.perl version 196 (Thu Oct 27 17:06:44 2022 UTC).