Media WG F2F - Day 2/2 -- 20 Sep 2019

DataCue and TextTrackCue / joint session with TimedText

See MSE, DataCue and TextTrackCue slides

cyril: There have been a few discussions this week on DataCue, TextTrackCue, I put together a few slides to summarize them.
... DataCue, goal is to expose in-band data to applications.
... The major use case is emsg

cpn: I would say it's not only about in-band events.
... For user agent that feature native support of HLS or DASH, events could be in manifest.
... We also want to be able to support arbitrary objects that applications may want to synchronize.

cyril: DataCue essentially flows from the media to the application.
... TextTrackCue flows the other way around.
... You let the browser do the synchronization, where the cue should be displayed, but the application prepares the cue.
... And then MSE for TextTrack is enabling end-to-end synchronized processing and rendering, from the container to the display.
... Browser vendors do not like additional parsing, which may be an issue here.
... [showing diagram of MSE/EME pipeline]

cpn: Different points of view of how much parsing would be done by the user agent.
... For certain well-known events formats, the user agent could expose a structured object, or just the raw thing.

cyril: TextTrack for MSE would hand the parsing to the JS but the user agent would still have the cues and handle the synchronization. We don't even know the time to the JS app, because it's the same time when the event is handed over to the app and when it comes back for rendering.

glenn: The call to the app would be synchronous?

cyril: Either way.

glenn: Asynchronous may be simpler, you'd need a handle.

cyril: The whole thing seems similar to WebCodecs.

padenot: In fact, that's the exact opposite.
... Things are still fuzzy though
... We did some experiments, that went wall.

cyril: Yes, you don't depend on "time marches on"

jer: In a way, for custom parsing, the user agent could just produce a DataCue and the JS app would create theright cue that gets feeded back into the rendering
... Strawman: push data in MSE, get metadata samples exposed as DataCue, without needing to expose any specific interfaces to define.
... I'm just throwing that out as an alternative.

[discussion on putting decryption out of the picture since events are not encrypted]

jer: In summary, mechanim to add custom support for currently unsupported timed events?

cyril: Yes

padenot: Metadata have been problems for years. People routinely demux things in JS, e.g. to get ID3 out of it. Easy for MP3s.
... The load is not really complex in that case.

jer: It does require specific knowledge about timed events formats.

padenot: Yes, and the UA already parses the data, so double-parsing happens here.

cyril: Question is what's next?

cpn: We have a DataCue repo in the WICG. This would be great input there.
... In the IG, we ran use cases and requirements. This proposal gives us much more what we're looking for.
... The ability to get events ahead of time is baked in. Separated from the triggering of the cue.
... I'm suggesting you use the explainer in the repo to iterate on this design and shape the API.

andreas: I think it could be one option. Your slides show that everything is connected. Possibly something that needs to be discussed together. I'm just worried that if it's just in the DataCue repository, it might be limited to the topic.

nigel: What's needed there is that the architectural components need to be separated.
... I have a similar point is that what would be a real shame is that, if we did all of this and didn't solve the synchronization aspects.
... One thing I'm conscious of is that the rendering side for audio sends samples to your digital audio converter. The rendering for video puts pixels in your video buffer. The rendering model for TextTrack is parse JSON, create DOM fragments, apply styles, seem to take longer. I'm wondering if we need to do that earlier to prepare things in advance.

David: Isn't that a bit tricky?

cyril: If the only place that is allowed to change the CSS comes from the in-band data.

jer: There's always a validation and caching problem.
... The web browsers are meant to render things very quickly.
... I don't know what the requirements are.

nigel: We've discussed threshold. We sort of ended up with 20ms.
... The metric is to measure when the text is there on the screen.

jer: One of the problems we have is JS.
... One of the points is that TextTrackCue v2 absorbs JS to process the cue.

cpn: It's not only for text track cue placement, so need a general solution

jer: We do as much as we can to keep things out of JS for this reason, not to have to get back to the main thread.
... That said, I ran an experiment. The average latency is 4ms on main browsers between when the event is triggered and when it's handled by JS.
... So things may already have been addressed.

wolenetz_: Couple of questions. Trying to get out reliable synchronization between in-band events to what? JS?

nigel: The strong sync requirement is related to output.
... There's good sync for handling of the input.
... The metric we need is for the output.
... Changing display of a subtitle caption.
... Or real-time sync with audio handled with Web Audio API.

cpn: There's also the ad-insertion case where the event triggers a switch in the video.

jer: It's not the time required to parse the metadata. It's more about display and rendering.

wolenetz_: If I understand correctly, to get a strong sync, need to offload the custom processing of data cues to JS on a separate threads, or media types that the MSE parser should understand. There's a large gap between the two.

jer: At the MSE level, sync issue seems not as much important as rendering

GregFreedman: There seems to be two things here. TextTracCue v2 and this thing. If it's all done in advance, do we really need MSE?

jer: The thing is that there may be timed events in the media stream already and you don't want to do the demux twice.

nigel: Alternately, it would also make sense to push all of our components to the same pipeline.

jer: Yes, we are kind of combining 2 discussions in one use case. It's good to have an overview picture.

nigel: Wondering if there's a model we can think of to change the firing of cues.
... Instead of going through the "time marches on" algorithm and the browser's idea of where the time probably is.

chcunningham: Through the MSE proposal, you'd have a separate source buffer for the text track data?

cyril: I would think so

chcunningham: Is this imagining new types of metadata that don't exist yet, or exposing metadata that already exist?

cyril: All the specs exist to do that. In practice, not a lot of people do that.
... I don't know about others.

jer: One possible use case is 608 captions.
... 608 will carry things in-band. Currently, in Safari, it shows up as a TextTrack.
... That would be one type of currently existing text track.

cpn: On that note, there is this Sourcing In-band Media Resource Tracks from Media Containers into HTML document that is in an unofficial state.
... Is it something that we'd want to rationalize?

cpn: From my reading, there's a number of things referenced from HTML that is not clear whether they are supported.
... E.g. Media fragments, advanced media fragments.
... The fact that it is not REC is a concern for me.

wolenetz_: We in Chrome have not shipped widely in-band parsing through MSE. Was behind a flag, now the old code is removed.

jer: Neither Webkit.

<cpn> https://www.w3.org/TR/media-frags/

<cpn> https://www.w3.org/TR/2011/WD-media-frags-recipes-20111201/

Yongjun: We always use out-of-band. In-band has never worked in our experience.

wolenetz_: Getting back to the DataCue use case, seen some use cases around emsg.
... If there could be agreement about specific types, would that satisfy most needs?

jer: It seems that it's more difficult to implement than the naive approach to expose events when you get them.

nigel: I feel that the distinction between in-band and not in-band is not always clear.
... MPEG-DASH is fetching audio, video and text tracks from separate URLs for instance.
... Is that in-band or not in-band?
... There are schemes in DVB for putting TTML in transport stream. Mandated for set-top boxes in nordic areas.
... If you fragment that and send that through, you'd like that to be exposed.
... On the other end, the BBC always does its captioning stuff out of band.

chcunningham: Two worlds. Broadcast use cases would like to leverage in-band. With the DASH question, it seems it should have a clear answer.

yongjun: Both are possible in DASH. You can put things in-band but no one does that in practice.

David: DASH manifest is essentially in-band. The fact that it's processed in JS is secondary.

[side discussion on the definition on in-band and band]

andreas: Appropriate rendering of cues is a priority. For me TextTrackCue v2 goes in the right direction.

chcunningham: Yes, I'm trying to understand what are the priorities for needs are.

andreas: The question is how to separate the different activities in different groups.
... WICG DataCue repository. A TextTrackCue proposal for WICG.
... PAL mentioned yesterday that he wanted to make responsibilities clear.

cpn: The new generic cue proposal would go in WICG.

hober: Yes, the main goal is to end up with updates on WebVTT specs

cpn: We can of need a place where we can do the overall architectural piece.
... I guess we could use one or the other to do that.

jer: It seems that the architectural discussion does not need to generate technical spec. It could belong in Media & Entertainment IG.
... The DataCue portion, end goal is to do it here.
... For generic TextTrackCue, end goal is Timed Text

cpn: How would the interaction with WHATWG work?

hober: The current envision working mode with WHATWG is to have them react when needed on CG repos.
... This room seems like a good place to discuss effective changes.

nigel: I'm just recognizing that there is lot of media activity going on in different groups. No group chartered to do horizontal reviews of media specs.

hober: Unofficially, the Media WG is the right group to do that.

jer: It's going to be the job of Mounir and I to coordinate discussions.
... To make sure that the different groups are aware of discussions when needed.

jongjun: CMAF, WebM, other file formats, what's the integration story? Do we cover all of them?

nigel: In Timed Text, we have liaisons with a bunch of external organizations.
... e.g. CMAF might say "subtitles shall have IMSC1"

tidoust: regarding the sourcing in-band tracks, is anyone interested in working on it?

jer: it would naturally fit in scope for this group

tidoust: I'm wondering if there are people willing to update the document and implementers's interest should updated be made.

[silence heard]

[Media WG charter allows for group to take the spec on board through DataCue]

Media Source Extensions v.Next

See Media Source Extensions repo

Matt: For MSE v.Next, we currently are trying to figure out the editors. I'm happy to edit MSE
... Netflix will find someone, also Microsoft will try to find someone
... Is anyone else interested?
... How to discuss MSE on calls? Last time, we had dedicated calls

Jer: We'll rotate which specs need attention for the monthly calls, and can have topic specific calls
... If MSE needs more time, we'll figure it out

Matt: Some maintenance work is happening on the W3C repo, and incubation in WICG repo with branches for each v.Next feature
... Would like a better idea of the process for incubating v.Next features, and how to merge upstream

Jer: I suggest upstreaming the existing WICG work as the starting point, then we can do PRs against the newly upstreamed spec
... Versioning will be the hard part

Matt: It's more complex for MSE than EME, as there are some old things
... How can we simplify? Will follow up with the team about how to manage the branches and v.Next
... The only incubation feature with a shipped implementation is codec switching
... We have some tests in WPT for that feature
... There's clarification added to MSE for codec parameters for addSourceBuffer and canChangeType, browser not required to accept them, we're relaxing Chrome's requirements around that
... Are there IPR considerations around v.Next MSE?
... Can we bulk upstream everything from WICG into the W3C spec?

Francois: No problem to merge upstream. At some point we'll publish FPWD. We have full IPR commitment with the Rec
... So don't worry for now. Publishing FPWD will trigger call for exclusions

Matt: The tests are in the same media source repo, is this expected procedure? Do we want a folder for V.Next features?

Mounir: We could keep that, for backwards compatibility. Want to avoid breaking stuff. Can put things into MSE, don't see the need to separate them

Francois: I think the tests people prefer to avoid versioning

<mounir> ACTION: mounir to talk to foolip to double check whether versioning is needed for MSE v2 WPT

Matt: We'll continue using Respec. What were the problems with EME regarding Respec?

Mounir: There's no problem

<tidoust> ACTION: tidoust to exchange with wolenetz on setting up MSE repo, updating boilerplate, ReSpec, etc.

Matt: There was some related MSE discussion around reducing overhead for applications that take media, containerise it, only for MSE to decontainerise and play
... There's a proposal for adding new byte stream formats to allow injection of demuxed or raw audio and video frames.

Matt: We can follow this up after the meeting
... Yesterday, we discussed latency hint. It seems this is meant to describe what happens after decode. these actions don't need to depend on what the source was
... I think a latency hint on the media element is for playback. We could think of use cases wanting to tie this hint to MSE behaviour, e.g., play through gaps
... Don't want to bind that hind to any of those things. Also garbage collection regimes
... Prefer to see this done on the media element, rather than MSE

Mounir: This seems to agree with the conclusion from yesterday

Matt: The next proposal, working on a prototype, is using MSE from a dedicated or shared Worker context
... Had a demo at FOMS, found some severe problems with it, related to implementation rather than the spec
... As it stands, the prototype doesn't use a new MediaSourceHandle object, it uses a URL to communicate the identity from the worker context to the main context
... Should have more to say, and an improved demo, by FOMS

Mounir: Do you have service worker in scope? Is it working in the prototype?

Matt: Service worker is different, it's about intercepting requests and servicing from a cache
... It's unrelated to Workers which is about threading

Mounir: I think it would make sense to use it from Service Worker

Jer: SW are shared across pages, what's the use case?

Mounir: I see the SW as the thing that does networking for the page, it's alive when the page is closed. Enables some offline use cases. When the page tries to play, it can serve segments from the SW cache
... I think there's benefit of doing that. From a spec point of view, SW is a kind of Worker
... Is SW out of scope for the spec? What do other implementers think?

Paul: I think it's not good, I don't see MSE in SW
... I'd have to see real use cases, I'm skeptical

Jer: There are implementation restrictions. The difficulty would be connecting the two processes
... I agree with Paul, we'd have to have concrete use cases to judge this proposal against

Mounir: Do we have many APIs not available in SW in the platform?

Paul: Yes

Matt: There seems to be consensus that exposing MSE in SW increases complexity, source buffers and GC issues. I agree about seeing use cases that can't be polyfilled

Mounir: Can you fetch from a DedicatedWorker?

Jer: Yes, then the fetch is interposed by a SW if it exists

Paul: Yes, and that does not mean there are going to be a lot of memory copying

Matt: I believe Facebook does the fetching from a Worker context and hands it off to the main context for MSE

Matt: I'm also working on eviction policies
... Please read the proposal
... A use case is game streaming, where low latency live is critical, and want to minimise delay by having a single keyframe, infinite GOP
... MSE doesn't work well with that, it has to buffer everything, a keyframe and all dependent frames are treated as one unit for buffering and GC
... I'm working on simplifying this to the core. Should seeking be allowed? What about seekable and buffered ranges
... I'll have a prototype implementation in Chrome to look at in more detail at next FOMS

Paul: What's the use case for seeking?

Matt: Seeking to the live head, if you've got behind.
... If you're using the infinite GOP mode, the keyframe may already have been dropped from the buffered range, so it may not be available for seeking
... There's potential for race conditions between seek, decode, and playback

Jer: Could solve this in the spec by disallowing seek with infinite GOP
... Could set playbackrate to Infinite to catch up
... Decode as fast as possible

Matt: Could disallow it, or allow but stall if seeking to a range without a nearby keyframe
... I'm investigating the complexitiies in Chrome
... There's a policy that could collect everything before the currently playing GOP
... GOP is codec specific, so we'll need to update the proposal and spec to be less specific
... I'd like to get help with purgeable or pre-emptive GC
... This could be used to prevent the UA from running out of memory, by not waiting for an explicit remove call
... I would like help with the spec for that
... Not all implementations may be able to do that, and we wouldn't want it to become the default mode
... Jer, would appreciate your help with that

Jer: OK

Jean-Yves: It's hard to know before implementing, there can be nasty surprises
... so it's hard to comment on what we may need. I'm looking forward to seeing the prototype

Matt: What's a keyframe, something that's signalled as such, or something that actually is a keyframe?
... Issue #156. When MSE was first worked on, the createObjectURL created URLS that were auto-revoking
... The implementation would revoke immediately, so couldn't use in a later event handler
... From discussion at FOMS, Firefox still does this.
... Now there's a createFor method for auto-revoking object URLs
... If we're using these from a Worker context to communicate to a main thread media element, there's a race condition if we use the original form of object URLs
... Chrome doesn't do auto-revokation currently, so there's an issue with media elements and object being kept alive
... Working on a new form where things can be removed, and delay auto-revocation.
... One complexity from auto-revocation, it's diverged from the MSE spec, will need to coordinate with the File URL folks

Matt: Issue #160 discusses ways to solve how an app can tell an implementation what to do when it hits a buffered range gap
... Solving interop issues, as well as trying to prevent stalling, and make the implementation more relaxed with respect to gaps
... And seeking forward in infinite GOP. Would like this in v.Next implementation, but not at top of my priority list

Jer: The new editors could take a pass through the issue list, and bring the list to the group for further triage

Matt: Sounds good to me

Jean-Yves: For MSE v.Next, the most requested feature we see is dealing with missing data or gaps, and eviction policy for low latency video

<Zakim> mounir, you wanted to talk about MSE in Workers a bit more

Jean-Yves: There was a bug from David at BBC, it would stall on one browser and not on another
... having a uniform approach to dealing with gaps, so should we wait for data to be appended, or should we skip over it
... In HLS.js, if they see a gap, they seek over it

ChrisN: Want to keep all viewers at the live playhead as much as possible

Jer: How does it interact with I frames?
... The spec says you must pause at the end of the buffered range. Could specify a time limit

Matt: Some kinds of gaps may not be full gaps, maybe the audio could play through but not have enough video

jernoble: there are two kinds of gaps: known by the application, and unknown. one potential to solve the application-case would be to allow the application to explicitly coalesce ranges.

Matt: Should we coalesce the buffered ranges? The app would have to poll for unexpected buffered ranges

Jean-Yves: If the gap is small and will be ignored, should we reflect this in the buffered ranges?

Jer: I think we do already, it's a CPU problem to poll for the buffered ranges
... If we decide to add spec language on which ranges to skip, we'll also specify how the buffered ranges would reflect them.

Matt: Two ways of looking at it. One idea is to let the media element continue to describe what the playback behaviour would be
... Or maybe the sourceBuffer is the place to look at the gaps and see how they've been coalesced.
... Proposal didn't allow apps to report the gaps

Jean-Yves: If you have no video but audio can play through, you don't want to have to wait for the video

Jer: Should we have a different have gap skipping behaviour for audio vs video tracks?

Jean-Yves: With gaps within the same sourceBuffer, reflected in the source buffered range. Then gaps due to missing gaps at the intersection of two buffered ranges, this is data that will not come

Jer: The ability for a client to bridge gaps on a source buffer basis...

Matt: Most MSE players use one track per sourcebuffer, but there's no ability in a multi-track source buffer, so you'll see gaps
... Should file an MSE issue to get some notion of track buffered
... Useful for implementations using muxed content

Jer: CPU usage was high due to requirement to create new buffered range objects from a polling loop
... HTML seems to have changed such that bufferedRanges doesn't require a new object to be created, may want this in MSE as well

Matt: That would help

<tidoust> See the definition of the buffered attribute in HTML
... and note "The buffered attribute must return a **new** static normalized TimeRanges object"
... completed with the warning "Returning a new object each time is a bad pattern for attribute getters and is only enshrined here as it would be costly to change it. It is not to be copied to new APIs."

Matt: There's room for improvement for the app to tell the implementation what to do. Should it stop, or let time march forward, or skip to the earliest buffered thing, lots of options.

Matt: I'd like some concrete use cases. Keeping up with the live edge is a good one
... May not be solved by what's proposed so far. I haven't had time to look at this, concrete proposals are welcome.

Jean-Yves: Eviction policy, can only evict when you get new data

Jer: It's bad at the end of video playback where we hold onto the buffered data unnecessarily. It has been requested by people at Apple concerned by memory usage on limited memory devices
... This one might be worth prioritising by the editors

Matt: We experimented in Chrome with pre-emptive eviction, but didn't see much improvement in the playback metrics

Jean-Yves: Also out-of band evictions

Jer: We can't change behaviour of existing applications

Matt: Bad for apps already tuned to existing eviction policiy

<jya> for information: sourcebuffer.buffered needing to return the same object if it hasn't changed

Matt: The newer eviction policies would certainly be more aggressive

Media Source Extensions in Workers

Mounir: I talked to one of the editors of SW, his rule of thumb for exposing APIs is to expose everything unless it's a foot-gun
... createObjectURL is disallowed in SW, because it's linked to the timeframe of the Worker
... What's the latest on how we pass the data back to the page from the worker?
... An object URL or a transferable object?
... for any kind of Worker

Matt: The Worker creates an MSE object URL and postMessages it to the main thread (although can have transitive Workers)
... It's just a string
... We can't use createObjectURL from a SW, so would prevent us from using this API from a SW
... Could we create a MediaSource object from a SW, and what would it mean to attach this to a media element?
... Trying to attach to multiple media elements will fail

Mounir: I'm worried that we design something that can't be used from this new part of the platform

Jer: Seems to be a question for the SW group. If we created a transferrable object, so I see issues with SW lifetimes and long lived objects

Mounir: I want to avoid having MSE unavailable being unexpected behaviour in SW

Jer: We could file an issue on whether to expose a MediaSource in a SW context, then discuss with the SW group
... This could also be a WebIDL issue, would need to check that

Matt: What scenario where MSE in SW is required, that couldn't just be solved by SW as proxy and offline cache?
... It would have to be some small amount of media, because of buffering in SW cache

Paul: One scenario is extremely tight real-time video where you want to avoid context switches. But this may be better solved in WebCodecs

Mounir: I could be that scenarios arise in future, so we need a strong argument not to do it

Jer: I disagree. This is going to be hard to specify, so we do need use cases to drive it
... It'll need a lot of spec language

Paul: It's similar to putting AudioContext into SW, or WebGL, why do it?

Matt: With object url based approach doesn't lock us into an approach where we can't use a transferrable MediaSource

Jer: We can use Issue #236 to collect use cases

Media Source Extensions - Demuxed and raw frames

See [Proposal] Allow Media Source Extensions to support demuxed and raw frames

Jean-Yves: [introduces demuxed and raw frames proposal], lots of settings are not present in the current proposal, e.g. for images decoding, you need decoded size and display size. Also crypto may be per sample, etc.

Matt: How would encryption information be transmitted with this proposal?
... Extensibility is a good question. A web app on a TV might have a very long lifetime, what if content providers need an extension to the format, what happens with previous implementations?

MarkW: Need a way to append with raw data

Jean-Yves: I see this as an important feature

Matt: This is a problem people hit all the time, remuxing in JS just to pass to MSE
... Extensibility is a concern, but how valid is the concern? We could add a new byte stream format to the registry

Yongjun: So we're changing MSE and EME to more of a frame player?

Yves-Jean: It's enabling use without a container

Yongjun: The container needs tobe there for timestamps and init data, we still need these

Jean-Yves: You also need the decoding time stamp for h.264. It's used in MSE, to determine if you have a gap

Yongjun: Try to avoid using the term "frame", prefer "access unit", it's a more comprehensive term

Matt: In MSE we use the term "coded frames"

Jean-Yves: yes, "sample" and "coded frame"

Matt: MSE tries to abstract itself from specific codecs

[discussion of terminology, PES packets]

Yongjun: What about pass through mode?

Jean-Yves: All this deals with compressed data, which may contain video, audio, text. At this stage we don't know

<jya> Gecko calls them MediaRawData

Matt: Thank you, looking forward to collaborating you with on MSE v.Next

Jer: Thank you too

Media Playback Quality

See Media Playback Quality repo

jer: Relatively small spec. Pretty solid, not many issues.

chcunningham: Copied and pasted from MSE.
... Chrome never launched the API. We have the webkit prefixed things though, which is weird. I'd like to fix that.
... Some of the issues are pretty straightforward. Some of them could trigger backward incompatible change

chcunningham: For issue #7, I think we should, because it has shipped in different browsers.
... For issue #3, apps may be interested in hearing about changes instead of polling, which could be done with an event, but not possible with a read-only object that obviously cannot change.

chcunningham: Any appetite to changing the API and adding an event, or something else? I don't have strong opinions.

GregFreedman: I like the idea of having an event, e.g. when it's dropping.

padenot: Can we create a new member?

jya: With regards to the buffered range, comment is if it's an attribute, then you should always the same object.

padenot: Maybe the right road. What constitutes a meaningful change may be hard. We observe dropping frames on load, when it does not really matter.

jer: You're thinking of avoiding false signals?

padenot: Yes.

Richard: Resizing windows can also cause dropped frames

chcunningham: It's interesting to get into that even if we don't add any kind of eventing.

jya: Example of playing a video at twice the rate. Only one every 2 frames gets displayed. Are the other ones dropped or not dropped?

chcunningham: We were reported as dropped frames. And that's bad. Not a signal of hardware performance.

jer: The spec is pretty clear here that something is dropped when it misses the display.
... It didn't miss the deadline, therefore it's not dropped.

padenot: Yes, it's not composited.

chcunningham: To close on the issue, there seems to be an appetite for eventing but not for breaking backward compatibility, I'll take an action to come up with a proposal.

jer: Interface of read-only objects is fairly complicated. A dictionary would be better.
... An option is to have a callback when a value passes a certain threshold.

<scribe> ACTION: chcunningham to work with jer and propose an API that does some sort of eventing in a backward compatible way, and that converts the VideoPlaybackQuality object to a dictionary

GregFreedman: What happens when video is not in the foreground?

chcunningham: Chrome's smart about this.
... You should not be observing troubles in that situation.
... Are there any other dropped frame cases that we should talk about?

jer: Good question for the content table.

GregFreedman: Sometimes, we see dropped frames when the video starts. Not sure whether that needs to be reported.

jya: For me, the smoothest bit should be the start. That's the only time when we can guarantee that we have the info.
... To display canPlayThrough, we wait until we have 5 seconds and 10 frame buffers

tidoust: What happens with Picture-in-Picture?

chcunningham: That's a good question. To be looked at.
... Moving on to corrupted frames.

jya: Does it ever happen?

chcunningham: That's my point. Chrome does not have that notion.

jya: We don't have the concept of corrupted frame either.

mounir: As far as I can tell, no one has.

chcunningham: I propose that we remove corruptedFrames from the spec.

PROPOSED RESOLUTION: Remove corruptedFrames from Media Playback Quality

RESOLUTION: Remove corruptedFrames from Media Playback Quality

[discussion on variable framerates]

jya: If we're late, we skip forward to the next keyframe. For example, Chrome will actually pause. We'll always try to play the audio. Similar behavior to what the Flash plugin used to do.

chcunningham: That makes sense.
... The case I wonder about is people using MSE for live.

jya: It actually makes more sense for real-time playback, because audio is the most important there.
... For 4K videos, Firefox may drop a good number of frames. But then Chrome pauses the video, so no dropped frames. Which is better?

chcunningham: We should file a GitHub issue and exchange about that

jya: It would be good to define what a smooth media experience is.

jer: Whatever we decide, I believe I should be able to convince the team that does this to adjust behavior.

mounir: I just want to remind people that changing object to dictionary may be backward incompatible. Interfaces are exposed today. Mostly used for feature detection. I don't personally care because we didn't ship the interface. But others may.
... I don't know if you wanted that to happen.

padenot: We may have counters of usage. I can check.
... We remove the moz- prefixed properties.

mounir: We have the webkit- prefixed properties. It is used in practice.

Media Session

See Media Session

jer: A new API incubated in WICG to provide integration with platform media controls
... play, pause, skip etc. intercepted by JS and implemented by page

mounir: Chrome launched MediaSession on desktop for hardware keys
... working on UI in Chrome that would also use those keys
... stop action added to spec
... playback state too, so you can define how long the playback is, playback rate and position
... can design UI with scrubber
... spec changes not in Chrome yet. Planned.
... working on plugin to enable Websites to benefit from this

paul: Gecko shipping this too

jer: WebKit also interested

<tidoust> Issue #233 Add "seek to start" and "seek to live" actions

cpn: doing live content, 24.7, segmented into shows
... have feature to restart from start of the show within live stream
... also seek to live position
... would be interesting to us to support these actions too
... more generally, what kind of actions should be added to MS beyond current set
... current set based on fixed duration on demand
... want to use with live content too

jer: do you present live stream to infinite duration stream, or presented with a duration for current show ?

cpn: latter - DASH stream with available time range

yongjun: Do you have all the future segments in the DASH manifest

jer: also up to page to implement the behaviour when a hardware button is pressed

cpn: concern is "next track" and "previous track" buttons might mean next / previous program vs beginning / live position in current

mounir: actually previous track often used to go back to beginning of current thing
... try to use this API to expose hardware buttons without implying what they are
... on Android we expose these keys to Android MediaSession
... if something not supported ... ?
... don't know if 'start' and 'live' are common
... talked about defining names for these

cpn: issue would then be the UI

jer: may not have control of the UI to demonstate button would be skip-to-live
... e.g. can't change label of skip buttons on YouTube
... but if you have a touch bar we mighty be able to label it - similar for PIP
... range of possible UIs means it would be hard to require buttons like this to be reflected in UI

mounir: have limited set of icons we can use

jer: however, if you let the UI know what actions are supported bworser can choose what best to show

cpn: so we would say we want
... next track / live track

mounir: recommend that. You'll get something on those platforms which have suitable buttons

jer: did discuss enum value for skip
... don't want to open it to arbitrary string
... limited set of localizable values
... e.g. select skip button and choose from set of allowed lables
... one of those could be skip-to-live
... could look into that, then BBC could prioritize the skip-to-live button and let the UI show it

mounir: replied to issue

<tidoust> Issue #191 TAG Feedback: of all the potential metadata...?

cpn: In the TAG feedback, Travis asks why we pick artist / title / album /
... my comment: those very music track specific. What are they for ? Display ? Might want more general purpose display fields
... or is the platform making semantic use ? e.g. recommendation based on artist. This is a whole space of media metadata.
... there are semantic web vocabs for this, schema.org

mounir: why specific to music: when Chrome did this mostly focussed on music. reflected priorities of the time. can change
... presentation vs semantic: not semantic, but do pass the information back to the OS on mobile
... metadata matters. Watch etc. tries to display based on what it thinks is important
... maybe it will favor artist over album. can't do that if you supply line 1, line 2, ...
... but actually we just try to display it all

cpn: concern is that if I have radio show I need to do my own mapping and then rendering will be different on different devices

jer: same information might be displayed in multiple places and we can't say e.g. how much space
... that said, artist / title / album could be generalized
... right now implementations are simple, though. space to improve

mfoltzgoogle: how many lines do you get today

mounir: today we show everything
... notifications more complex

mfoltzgoogle: if we add more semantic tags which we can't display all at once. Who prioritizes ? Browser, Page ?

jer: existing problem with actions - no priority score on which actions page thinks are more important. UA chooses
... only choice page has is binary to advertize support for the action or metadata or not
... don't think there is an alternative
... if we had schema.org syntax I'd hope we could use it
... how much metadata is on a iTunes track, for example ? Doesn't all show up: already have this problem.
... doesn;t get harder if we add more data now

mounir: schema.org - weren't aware of this enough when writing spec. did start a project to use schema.org as default values for media session in Chrome
... not shipping soon, but wonder if we should incorporate into Media Session spec. Not sure how easy this would be ?

cpn: v.interesting for us

mounir: yep, looking into it in Chrome. Can show a nice UI today, but just the title

cpn: if we annotated using schema.org we'd like all UAs to pick that up

jer: are you saying alternative to broadening the schema for metadata would be just to use page level schemes

mounir: MediaSession is imperative not declarative, but schema.org is targeted at search engines
... YouTube does not provide schema.org data if you are not a search engine
... need to have a better understanding or relationship between schema.org and the content metadata
... don't know if TAG would like this

cpn: Travis referrenced schema.org

mounir: but it's a strange JSON-LD thing that browsers don't support
... don't want to copy-paste from there. good point. we need to look into this

cpn: would we be constrained by Android MediaSession ?

mounir: no, we would just convert if we are missing anything
... mostly title / artist

tidoust: can really go a long way describing these things with just a few base properties from schema.org "Thing"
... can already do a lot. CreativeWork underneath Thing
... very useful way to normalize everything in the world in a shallow structure
... including artwork
... more specific Things too with further properties

mounir: my read is that noone else has anything to add

paul: there are some issues raised by Mozilla, #235, #237, #238, others (?)

mounir: #237 is WebIDL Boris bug

paul: #238 editorial, there is a patch
... PR#235
... constructor change needed evrywhere

mounir: <merged the patch>

paul: Issue #234 is more substantial: MediaSessionActionHandler doesn't work for seek operations

mounir: had this issue with Permissions API and ended up using object
... made me very sad
... permissions API you have a descriptor and WebIDL - object bypasses the WebIDL

jer: same as generic TextTrackQueue - derived IDL interface ca\n't accept a different type for the same method as superclass

eric: you can use any
... we had a tag in the base class that you have to key off

paul: I like same solution for all occurences of this problem. object used in the past

mounir: should we ask TAG ?

jer: might also be q for IDL people

paul: let's punt until we have the relevant people

Picture in Picture

See Picture-in-Picture

mlamouri: pip api is launched in chrome
... proprietary (different) api launched in safari as well
... Mozilla has chosen to defer

jer: apple's api is not entirely the same, but it is compatible

mounir: companies have built polyfill to use either api
... the api nowadays is not very controversial
... considered a skip button, didn't pan out
... considering an auto-pip behavior
... chrome has some code for auto-pip, incomplete, behind flag
... also looking at integrating media-session and pip in chrome (integration detail)
... we should also discuss arbitrary content in pip
... lets talk about v1 topics first

scott: issue #119
... we want to define when the controls should show in the pip window

jer: clarifying, the api would let the page tell the UA what controls to show

mounir: we (UA) may not know perfectly what will be shown

scott: the idea - when their are action handlers associated, is that when we should show the buttons for those actions?
... or should it be up to ua

jer: up to UA
... say UA is trying to implement but has no control over what shows up in pip window
... if they can control, then they can look at the installed handlers

mounir: site could disable the action via media session to hide controls in pip

jer: issue #167, someone wants to flip camera rendering to simulate mirror mode while capturing in pip

mounir: this is v2
... the way its so far implemented we ignore any transforms

eric: it could theoretically be implemented

mounir: UAs don't have a clear strategy to go about it

jer: if they need to do this without a spec change they could do video -> canvas -> transform -> media stream -> video -> pip
... crazy, but doable

fbeaufort: spotify folks use this canvas trick to display a pop up video player

paul: thats very inefficient
... lots of main thread, memory thrashing

mounir: issue #163 about maximizing the window
... not a good idea for security reasons

eric: i agree

mounir: issue #166, add option to go full screen
... like the idea, but UA would should its own controls, which sites may not like.

jer: is this bug asking for full screen on closing the pip, or go straight from pip -> full screen?

mounir: the latter

markf: has full screen been considered as a media session control?

paul: on android, if you tap on pip window in android there are os controls to go back to previous non-pip state, which may have been full screen

mounir: here it looks like user wants to do full screen from pip. real solution is to have a full screen media session action

chrisn: would be the interaction if you requested full screen while pip is up?

mounir: if you click full screen in the player while pip is open outside, we swap to full screen

chrisn: similar queston for remote playback

mounir: i think we don't re-inline window if you initiate remote playback

chrisn: so pip window would remain, just empty state?

mounir: yes, ideally we should fix that

jer: i think its a UX choice. designers call what to do with pip window on fullscreen/remote playback

mounir: we could add some non-normative suggestions for these

<scribe> ACTION: mounir to add non-normative suggestions for behavior for pip window if remote/full screen engaged from main player

jer: issue 156, integration with HTML

mounir: we got this feedback for a number of specs on moving from wicg to media wg

mounir: something they suggested was to merge into html spec to avoid monkey patching
... i'm opposed. we should avoid monkeypatching, but opposed to merging into html spec because its already a fairly large spec and it will split the pip spec into two places
... what does the group think

chrisn: i've seen a tag issue on this, concern about monkey patching in general

paul: html spec is already so big

mounir: what I hear is we could consider changing interfaces, but we generally don't want to?

paul: yes, i oppose changing the interface for theoretical purity
... i see dominic's point, but media is special. everything hangs off the media element, but we have a ton of specs that are different/complicated extending this outside html
... his concern is monkey patching, not extending the interfaces?
... can we do the opposite? take html media element to a different spec?

mounir: open to it, but suspect its not up for debate

jer: at some point ian conceded we could move to a separate spec, but it never happened
... agree it is a possibility

paul: is it self contained enough?
... could we grab a clean section?

eric: it does seem possible

jer: the media section is ~70 pages of the 900+ pages of the multipage html spec

mounir: so the room seems opposed to merging things back to html; we're open to moving media out of html

<scribe> ACTION: jer to discuss moving media out of html w/ hober

mounir: moving on to pip v2
... first topic: who is interested in an API that can do more than show a video?
... bbc, microsoft

jer: is netflix not interested?

greg: we tested it, but its perceived not a net benefit to users. still a possibility, but slim

jer: does netflix app on ipad do auto-pip

jya: yes it does, when you press home button

greg: netflix pip testing in general, haven't made final conclusion, but lean toward reserving the toolbar space for other things

mounir: most of feedback we got initially was for non-video additions to pip
... new buttons, etc (e.g. mute)
... Youtube, twitch, similar feedback
... folks clearly wanted to customize the window and the existing api was too restrictive

greg: let me add, web UI felt strongly about pip (positive), but it didn't improve our streaming metrics

mounir: netflix may not be the core pip use case - many cannot multitask with a pip movie and something else

mark: we have folks who are interested, will continue to provide input, but can't commit to a roll out at this point

mounir: we tried initially to do custom controlls
... thought this would solve most issues (mute, full screen) etc
... but it had drawbacks: couldn't let folks use their own icons
... we pivoted to an API that lets you put anything in a pip window
... not a pop-up, still requested via pip api
... API looked weird, but it was feasible
... worried about dev experience, moving objects between documents
... most ua's reset video when it leaves a document
... for EME, would mean resetting keys (possibly not fixable)
... next: could we take any part of the dom and show in pip (similar to full screen today)
... killed this idea, pip window would inherit screen/window attributes of its parent, makes position/sizing very hard

jer: have you considered doing presentation API -> present to pip window?

markf: yes, in second screen group, explored a second window object
... resolved some of the issues for opening second window
... but still needed a post msg api to talk between the windows
... seemed to high a barrier to adoption

eric: js in the current page couldn't access the pip

jer: sites that might adopt pip v2, might also adop presentation api for casting
... they might just work together automatically. post msg could work the same for remote cast as to a separate window?

markf: there is a second browser context loaded, but presentation API users mostly run in 2UA mode, where cast is not necc presenting the same content as the original player

mounir: presenation api even for going full screen on a separate window seems difficult for developers
... the latest thinking (early draft), is to do an element that you could use to write some content, like an iframe, but not. would have different window/screen instances.
... can integrate with cast and full screen api
... aim: keep dev experience smooth as possible, let sites say the player is this special object, which changes how its rendered depending on its pip/fullscreen/inline mode
... very similar to iframe-seamless - old idea that was never implemented
... its a very big change for pip strategy
... but hope it resonates with how sites actually position players

eric: off top of head, sounds like a huge amount of work to impl
... looking at incremental benefit vs other issues, I'm not sure cost-vs-benefit is worth it

jer: maybe eaiser if were actually an iframe?

eric: ppl tried to do this with magic iframe, it was a disaster

mounir: the inner frame would be a different execution context, but you could see/manipulate the doc from parent frame

jer: it might be easier impl/spec if we didn't try for seamless iframe here
... exploring an alternative: original proposal seemed very spec heavy. but leveraging an iframe instead, letting it go full screen, could solve porblems
... if we had said earlier that the only thing that can fullscreen is a document element, this would solve a lot of problems for mac UI (full screening inner elements is awkward for multi-desktop)
... if we do the same thing for pip, we may avoid similar issues

mounir: this implies every site would wrap a player in an iframe
... sites would not do this because it adds load latency
... having this seamless element is easier for sites to use, just harder for us to implement

markw: the amount of work required to impl needs to match the value

tidoust: could we simply do overlays on pip?

jer: our HI teams won't agree to even less invasive proposals
... no way they will agree to this

mounir: also considred a pip where we could paint whatever you want, but no interaction

jer: pip 1.1 could be to take a video element with associated gl canvas on top?

mounir: similar to what we proposed, sending part of the dom
... would require us to create a window with web content
... return the issue where screen/window info is wrong

jer: if we had generic text cue, we would make it work in pip
... but not interactive

ericc: yes
... some open questions still

jer: so canvas overlay on pip will come naturally from our impl of generic text cue

mounir: this won't be enough for some sites
... in summary, apple seems to think this is too complex
... mozilla goes back to "defer" position
... technical aspect?

paul: I see the value, but seems very complex
... if it were implemented, it seems useful, but it seems very hard

jer: take a look at the history of magic iframe (failed)

eric: it was maybe a mode for iframe, definitely a webkit thing, not sure of time, caused all kinds of issues

Autoplay Policy Detection

See Autoplay Policy Detection repo

adenot: On the Web, videos are now often blocked from playing automatically, with sound.
... You can know you're blocked by calling play(). When playback is disallowed, the promise is rejected.
... This requires having a source, and requires mutating the state of the media element.
... Parties have requested an API to not mutate the video element, to know if autoplay would be allowed.
... For example, fetching different media with subtitles if autoplay is not allowed.

dbaron: Seems to be a property of the media, if it has sound.

adenot: Not a property of the content, whether it's muted.

Tess: Safari detects whether the media has an audio track to determine autoplay.

mounir: The use case from FOMS is to ask whether to show an ad with sound or without sound, the checks take too much time.

padenot: 2 ways to achieve. Per-document or per-element.

[Reviewing proposal from TPAC 2018]

padenot: API would return what play() would return.
... Metadata is important. You don't know if you have an audio track or not without it.
... Webaudio has an event. Needs to know if it's slow to start, or won't start at all.
... document-level API resolves this.
... there is a new readonly attribute on document that returns a value for the enum.
... allowed, allowed-muted, disallowed, unknown.
... unknown is for a per-media-element policy.
... Firefox can ban all autoplay (even w/o sound).

jer: So does Safari.

ppadenot: We need a thing that returns a value.
... 2 ways. A sync API, readonly attribute that hangs off the document.

jer: Answer can change over time based on user interaction. Answer will change from call to call.

mwatson: Will the spec define enum values, what autoplay means?
... Make it clear to developers what is allowed and not allowed.

mounir: Would like to go into a state where it's better defined (like muted).

jer: [Question about user gesture] It's not universal. We didn't have a definition of user activation until recently.
... would allow us to make a normative reference.

jya: Mozilla used to have a banner for autoplay. We replaced it with a user gesture.

mwatson: Sites broke because behavior was not well defined.

Tess: Distinction between defining the browser behavior, and writing the spec.

jya: This one is designed for our users, versus what content providers want.

mounir: User activation will define how long the gesture is allowed, through a Promise chain.
... today when you click, some browsers will work in event handler, others will use a timeout.
... musta's work will use a timer or propagate in a setTimeout.
... just exploration at the moment.

padenot: Contention is shape of API. Pros and cons.

mounir: sync is clear and simple. con is that policy impl is complicated, overhead.
... pro for async: does not require prior computation. Con is more complicated, some delay.

jya: With async, user actions may the answer.
... Answer could change in an event loop.

mounir: Permissions API has an event. Site can be aware of change if they want to.

jya: if you had a sync attribute, and event, how would Chrome implement it.
... Whenever someone read it, it returns unknown, read it again, it returns a different value.
... we could add a new state since unknown means check the element.

mounir: Today Chrome has to do a lot of work to get the autoplay policy, we have settings, we have a preseed list.
... we have an override mechanism for users, and a pre-seed list.
... we need the content settings, we need the database, and we need the pre-defined values.
... today it's every time we go to a page. We would like to do this lazily.
... there is some expectation that paused stays true when you start play().
... could change since play returns a Promise.
... a browser in the future could need an async API. The only problem here is a microtask.

ericc: Doesn't the browser need it for the autoplay attribute?

mounir: We could do a roundtrip in the browser. Chrome doesn't use the attribute until it's visible, no timing expectation.

<Zakim> markw, you wanted to ask another question about the definition of autoplay, not to comment on the a/synchronous issue

David_Baron: I feel like it's close to the boundary

padenot: This is implementable when doing complicated things, by getting info on page load.
... Already a cross-process message. We hit the database when you fetch the website.
... We send this one bit, autoplay allowed or not. User won't have a delay.

Tess: Unclear that autoplay processes can be improved by disk access. People seem to do well with small databases.

jernoble: The stakes of this argument are low. We don't have a clear answer for design principles.

Tess: Action item to make principles clearer.

jernoble: World with more databases, more processes to load a page. If something intrinsically requires a DB, or a fetch, then a Promise is the right choice.
... If it doesn't require a network access.
... It may not need a Promise.

mounir: Permissions API, TAG asked us to make the API async. If query, which is much simpler than autoplay, is async, why is autoplay?
... autoplay is more complex than permissions. More states than a key-value entry.
... Permissions had to be async, because it's a database access. Not reasonable to dump autoplay DB into renderer.
... If we have a good reason to believe it will evolve, then async.

jya: Complexity of the implementation in Chrome.
... Improve means that the result will be slower, it's not an improvement.

David_Baron: Implementer needs are not always lowest. Our ability to evolve the platform, don't make one thing perfect at the cost of other attributes.

padenot: The solution is already implemented. As a user it works.

mounir: It slows down the loading of Chrome. It increases memory usage. Load data we don't need.

jernoble: We care about page load time, from a fresh launch. If something affects page load time, someone will care.
... Even if the memory usage is amortized over page loads.
... We will need to take care of edge cases with an async API, it's not free

mounir: On iOS you need to wait for a user gesture.

Sanghwan_Moon: From a user perspective, sync makes more sense. Impl complexity argument doesn't make a lot of sense, personally.
... Would making it async introduce anti-patterns, reading multiple times.

jya: Typical use, if <blah> then loading my page.

Sanghwan_Moon: What if it took 200ms?

mounir: Most implementations would call play on the result.

jya: Typical: can I autoplay, or show a popup to allow autoplay.
... news page, please click to enable your sound.

mounir: Would a delay be negative for the user? Ads use case, need to show right away.

jya: Ease of implementation, add a message if the policy should change. Event when the policy is known.

mounir: We won't do that.

David_Baron: Writing content for Chrome, others using event and not looking at value.

mwatson: What is not an autoplay? When the user has requested playback.
... I've put up a play button, the user experience could be a sequence of videos.
... Our definition of autoplay needs to account for that. That's an experience we want to deliver.

Media WG F2F - Day 2/2

20 Sep 2019

Attendees

Contents