Media & Entertainment IG, Timed Text WG, Media WG Joint Meeting

Meeting minutes

Agenda bashing

cpn: 2 hours for this, we may not need the whole time. 3 main topics to cover. We'll start from an update from Nigel on Timed Text, audio description, then TextTrackCue proposal. Finally, I'll give an update of where DataCue is.
… There was the possiblility to talk about TextTrack support in MSE. I don't know if anyone on the call is able and willing to talk about that today, feel free to chime in if you do!

Matt_Wolenetz: I'm available if people have questions about TextTrack support in Media Source Extensions.

cpn: Excellent, maybe we'll do that near the end of the call then

Audio Description profile

nigel: Chris introduced this topic as being from the Audio Description CG. That's where it originated. Since then, it has moved to the Rec track in the Timed Text WG.
... Not a lot of progress since the document transitioned, I must say.
... Additional energy from people would be most welcome.
... [showing ADPT specification]
... It's an exchange format for audio script, mixing instruction.
... There are some examples in the specification. Based on TTML
... [going through examples in the spec]
... You can end up with complicated instruction to control the gain and so on.
... It could be done on the client or server-side to generate a separate audio track. That does not matter.
... The sort of implementation that we've published here has some video and showcases the benefits of the approach.
... One is to adjust the relative volume of the audio description compared to the main program.
... Another is making this available to assistive technologies.
... [goes through Adhere demonstrator of client side AD with ADPT]
... All of the features that we're using here are in TTML2. This is a profiling activity.
... We need a substantive part of it.
... It appears that this should be all quite easy to get this done and in standards space. But it just needs some energy, and I've had other priorities.

cpn: My main question is implementation support. Presumably, all can be done through JS?

nigel: Yes, this is what we use here, through Web Speech API, and so on. It would be kind of nice if Web Audio finally got to Rec, but Web Speech is obviously not being worked on for now. There is a whole area of discussions to be had about the needs to provide some speech fonts.
... From a BBC perspective, it would be nice to be able to provide a BBC voice!
... Something to think about.

plh: Web Audio is done actually. Just wrapping things up, Rec should be by end of the year or just slightly afterwards.

cpn: About speech synthesis API?

plh: Not in scope of a Working Group. Not aware of any recent discussion on the topic to be honest.

cpn: Any other perspective, from implementors perhaps?

RobSmith: This strucks me as something similar to what we're doing with metadata. Isn't there going to be some latency issue as you need to download remote assets?
... Don't you have synchronization issues?

nigel: Thanks for the question. That proxies the point I wanted to raise. Fetching resources is one potential issue, and second is synchronization of playback.
... The sensitivity of timing for audio is relatively high. You may end up missing the beginning or having some dialogue at the end.
... It still works pretty well.
... [giving it a try with another demo]
... The browser code had to catch up. That illustrates the point very well. If you rely the more local processing or fetching of resources, you may run into problems. That's a good argument in favor of native support for the whole thing.

JohnRiv: Similar to what we've been discussion in Web Media API guidelines. What's the recommended way for the user agent to handle synchronization when video playback and Web Audio?

nigel: It's a real-world issue in some devices and a good question.
... If we were to do it from a BBC perspective, we'd probably have to do it in a pre-rendered mode, on the server.
... And we would lose some of the benefits of it.
... Additional energy to develop the spec would be welcome. People in the group would probably be willing to do some implementation work.
... If there is just me pushing for it and nobody else, that probably should not happen. It just would be a shame.

cpn: What could we do to change that?

nigel: We have a number of companies represented in the call. Come and have a chat with me!
... It does not need much, but it does need some.

Proposal for a new Text Track interface

Tess: Eric and I first pitched this at FOMS a couple of years ago
… iterated on it, got feedback at TPAC last year
… we've made a lot of progress
… I'll talk about the problem we're tying to solve.
… Today, if you want to deliver captions to the browser, you can use WebVTT for delivery and display,
… or you can do it all yourself
… A lot of people of people do it themselves, not using WebVTT.
… There are conversion costs, large library of content with subtitles, would be expensive to convert to VTT while preserving fidelity
… Storage costs. Also YouTube does dynamic caption generation, and the current API doesn't cater to that use case
… Bespoke is costly. You have to write your own renderer and maintain it.
… Accessibility costs, different jurisdictions have requirements, e.g., for user customization of caption display
… VTT served to the browser gets that for free, from the device
… You have to manage that yourself if you roll your own
… Platforms have picture in picture, which work with built in caption support
… Performance costs. If you rely on the browser's built in media stack, including captions, you're more likely to get frame accurate display of captions
… [Shows Mac OS accessibility preferences page]
… At TPAC 2019, we proposed to split the built-in browser feature in half
… to decouple the native display code from the VTT parser and processor
… We came up with an abstract JSON model, generate a JSON blob and hand to the browser
… Shuold be straightforward to generate from common caption formats
… [Shows JSON example]
… It looks like a serialization of HTML in JSON.
… Feedback was why not just use HTML?
… Why not hand it a document fragment?
… If you're already rolling your own caption support, you have code that generates HTML
… This could be a cleaner way to hook into your existing code, and as a polyfill for older browsers

Eric: We decided that we could take what we have in the spec now, and make some minor modifications to get where we want to go
… In the current spec, TextTrackCue doesn't have a constructor.
… So we defined a constructor taking start and end time, and a document fragment (rather than text, per the VTTCue constructor)
… We move getCueAsHTML() from VTTCue down to the base class
… We want to make changes to HTML, CSS Pseudo-Elements, and WebVTT
… We don't think it makes to allow anything in the document fragment, so we define some processing requirements for what's actually allowed
… For the UA to apply the styles from the user's settings, the author needs to be able to identify the parts of the document fragment that represent the text of the cue and the cue background, so you can style these differently

Joey: Shaka player supports some smart TV platforms from before VTTCue, where TextTrackCue had a constructor
… How would we be able to detect the new vs the very old TextTrack Cue?

Tess: You could check for getCueAsHTML on the prototype, yes

Eric: We want to move pseudo-elements defined in WebVTT CSS extensions to CSS Pseudo-elements
… In WebVTT we need to make some minor changes - moving from derived class to base class
… We're proposing a very limited subset of HTML to be allowed in the document fragment
… Needs discussion. In the WebKit prototype, we allow br, div, img, p, rb, rt, rtc, ruby, and span elements
… These are styled with ::cue and ::cue-region pseudo-classes
… We recognize the cue and the cuebackground attribute
… It's inserted in the shadow DOM under the video element, so it's not accessible to script
… This is an extension to the web API we have now to try to make a more flexible arrangement for captions and subtitles
… We can do it with small updates to HTML, CSS, WebVTT

Nigel: This is really exciting. The end result looks good, interesting approach
… There seems to be a core of this, which is a semantic equivalence between concepts in HTML and things you want to customize via system settings
… Whenever the subject of "what is a subtitle" comes up, we get different answers. What is the element to which it's reasonable to apply system level styling?

Eric: The system settings let a user customize the display in terms of the text: color, font, outline, etc, and the background around the text
… This is based on requirements we have
… We looked at the requirements and settings and tried to come up with the simplest model we could
… It worked to define two different parts of the cue. Because it's based on tags in the document fragment, there's a lot of freedom for the author to define how they want the styles to be applied, or not be applied
… We also use styles defined in CSS. When a user defines their style in the system settings, they pick a font size. That's the size to use for a cue unless the cue has a defined size
… So it's an override for what's in the cue itself
… The model gives flexibility to allow user to define their needs, and to the author

Tess: Part of the goal is to make authoring as simple as we can, while making it adapt the display to regulatory requirements, and flexibilty to achieve a desired layout
… This seems to be the smallest change to HTML to achieve that
… You have the ability to have your own styles, complex interplay. We think this is the right sweet spot.

Nigel: The change to HTML is to add cue and cuebackground attributes. Do these tell the customization mechanism where the cue and background are?

Eric: Exactly
… We have a demo, but it doesn't work right now. We took some existing BBC videos where they currently render captions themselves
… By injecting scripting, I could override their JS based rendering. I looked at the structure of the markup they use and modified it slightly by adding these attributes
… Their polyfill uses a VTTCue, but I could use the new API and made the cues look the same as they do now
… When rendered natively, when I change my local caption preferences, those are reflected in the way they're rendered, just by adding those attributes

Nigel: How does it identify if a background is set? Every element has a default background colour, even if it's the default. How do you detect that?

Eric: We just use CSS, the fragment is in the Shadow DOM, but it behaves as if it were anywhere else in the DOM

Nigel: How do you know if there's something you mustn't override?

Eric: We generate a UA stylesheet based on the user settings. If the user has specified a precedence, we make it the most important rule and let CSS handle it

Tess: The UA !important always wins, so this is how it works

<Zakim> nigel, you wanted to ask about the derivation of the semantic model equivalence in the HTML - what's a "subtitle" etc?

<Zakim> Matt_Wolenetz, you wanted to "ask about conversion costs: are they less with this approach"

Matt: I'd like to know more context around how document fragments would be easier to convert from legacy subtitle formats vs using VTT cues

Tess: We're looking at the existing ecosystem. People already have code that does this, genrating HTML code to add to the page
… It's a small change for exsting libraries, instead of creating a VTTCue, generate the HTML and put in the TextTrackCue constructor
… There are caption format that don't translate with full fidelity to VTT, so it's lossy. In those cases, conversion to HTML can be lossless

<nigel> +1

Tess: If conversion can be lossless, there are also the storage costs for pre-converting these things. Either you convert on the fly, duplicating effort over time, or you double the storage needs
… The storage costs can be prohibitive for some services

Matt: If there were a full fidelity conversion using VTTCue, would the use case be solved similarly?

Eric: The issue is that existing polyfills work by converting from a native format to a document fragment

<atsushi> sorry but conflict joint meeting from next top (i18n+CSS). let me follow by minutes on TTWG+Media joint for later an hour.

Eric: Then the make a VTTCue that's only used for timing purposes
… When the cue events fire, they take the document fragment and insert it into the DOM and remove it from the DOM
… in order for such a polyfill to switch to VTT, they'd have to write new code to generate VTT instead of a document fragment. So this seemed a better impedance match

Matt: I wondered if this would help with the storage and conversion problem.

<Zakim> mounir, you wanted to hear Jess and Eric thoughts about second screen use cases when using HTML as an input

Mounier: At TPAC two years ago, Mark Foltz and I met with people at the time. The use cases is second screen.
… With a simple format we could pass with the video to a second screen.

<cyril> are the slides posted somewhere

Mounier: HTML makes it more difficult to do that. Have you consdiered this?

Eric: Would getCueAsHTML() not work?

Mounir: Having HTML as input means that in second screen scenarios (e.g., cast devices), the cast device doesn't know about HTML
… Previously it would be easier to handle the JSON. Requiring the second screen to render HTML could be an issue as some screens cannot do that

Eric: It's a good question. There'll be trade-offs with any solution. One (not great) option is to render locally and send over an image. You could also convert it to some intermediate representation.
… It seems to us that most of the uses won't require that simplification, it's a tradeoff that made sense in the big picture

Chris: How to continue the discussion?

Tess: We have an explainer, we can draft some spec language
… That will help with clarifications. We could take the document to WICG to start with, if it matures we could bring to Media WG

Tess: One thing I like about this is that it's refactoring existing spec text, not lots of new spec text
… I'm hoping the actual amount of spec changes needed is minimal

Nigel: There's also the Text Track CG.

DataCue API proposal

cpn: This is an update on the WICG activity that we have around the DataCue API. It is an API proposal to allow apps to create timed metadata cues to be triggered on the media timeline.
… It is consistent with the existing HTML TextTrack API.
… This is also a proposal to expose in-bands metadata tracks.
… We've been collecting use cases for this. The most straightforward is lecture recording with a slideshow where the timed metadata cues update the slides in sync with the video.
… Also Video synchronized with map, which Rob has been working on (WebVMT).
… Then there's client-side dynamic content insertion where you may want to trigger some video overlay at some point in time, and the timed cues tell you when.
… Playback metrics reporting is another use case to track how far playback has progressed.
… Video playback with overlays, such as in sport events.
… Also live linear programme events (BBC has "now" and "next" for instance)
… Historically, the DataCue API was in HTML, implemented in WebKit, and then dropped from HTML.
… More recently, some interest to surface in-band tracks in MSE.
… Strong interest from external communities such as CTA WAVE and DASH-IF to surface in-band tracks.
… and in particular emsg boxes.
… This was brought to the Media & Entertainment IG in 2018. Since then, we set up a WICG project to develop the incubation.
… For the in-band timed metadata cues, the primary feedback that we got is interest for emsg boxes. But there are other formats as well.
… One of the goals is to reduce scalability issues for distributors by allowing to distributing metadata along with audio/video, not to have a to scale servers.
… Also makes it easier to integrate with usual streaming server pipeline.
… Additionally, apps may want to generate their own timed metadata cues.
… What we haven't discussed is having some kind of API that can be used to surface in-band captions formats such as CTA 608/708.
… There's some speculative text about how you might go about that, but that's not something that we have actually discussed in the incubation group.
… If anybody's interested in that aspect, we'd be interested to discuss.
… We're proposing this as a browser API because, as we talked to companies developing for embedded devices, they want to reduce the amount of parsing that they have to make using JS.
… Having to do some extra work to extract segments duplicates some of the work that the user agent is already doing. Can we rely on the user agent instead?
… And then there's the other argument around developer ergonomics. We heard this in the previous discussion with VTTCue. Since it has a constructor, you can use it to create your own metadata timed cues, using serialization/deserialization. But that's a workaround.
… Three kinds of cues:
… 1. instantaneous: start and end time are equal. There is a slight issue right now where cues of that type may be missed.
… 2. cue with known start time and end time. Typically how captions are working.
… 3. Also cues with a known start time but an unknown duration which may become known at a later point in time, or remain active indefinitely.
… Example of a video with a map track, or captions in a live stream where you don't necessarily know when it's going to end.
… In terms of the timing aspects, these are the 3 types we're looking at.
… The proposal is to re-introduce the DataCue API, based on the Webkit implementation with one minor modification.
… The actual data that is carried in the DataCue is in a value attribute, with a type field that tells you what type of value it is. This is useful in particular for the in-band case where the application will have to look at this.
… The data field that used to exist seems no longer needed. Discussion is open on whether we deprecate/remove it.
… Unrestricted end time to account for type 3. cues mentioned above.
… We have a PR open for review against the HTML spec related to that. I'd be interested about feedback on how best to move that forward.
… Specific proposal, assuming that there is implementor's support for it: standardize DataCue, probably in the Media WG once it's ready to transition out of incubation.
… I already mentioned the unbounded cue duration.

<RobSmith> https://‌github.com/‌whatwg/‌html/‌pull/‌5953

cpn: There was a previous change to HTML that we made to recommend firing cue events to within 20ms of media timeline. My understanding is that Chrome has been working on that. Don't know about the exact current status.

<RobSmith> Unbounded TextTrackCue pull request

cpn: And then we have a proposal to define a standardized mapping to media in-band events, in particular emsg boxes in ISO BMFF / CMAF. We're working on this in collaboration with DASH-ID, they have some processing model around such events.
… The Sourcing In-band Media Resource Tracks from Media Containers into HTML document is referenced from HTML, but it is not being maintained and does not really have a natural home. In a way, that would be the logic place to document the standardized mapping, but it's not clear whether what it contains already is actually implemented.
… Get to something that could be referenced more normatively?
… I talked about short duration cues that could be missed.
… The solution to that is to use cue enter/exit event handlers. This is fine when the app creates the cues. It's much more difficult when the user agent generates the cues.
… And not clear how the application can be aware of new cues.
… Then there needs to be some work about what we call on-receive processing for emsg cues.
… If the app needs to load some resources linked to a cue, that suggests that there are two steps: a preparation phase and then a phase where the cue actually affects the page.
… Early exposure would be good.
… The group is looking at option on ways to handle that.
… One is an event when the user agent parses a cue. Consistent with existing media libraries such as Shaka player.
… Another would be to add an early dispatch hint that would make the cue fire a bit in advance.
… That's all I wanted to present.

<Zakim> Matt_Wolenetz, you wanted to "discuss integration with MSE specification of DataCue API. Separately, Chrome has web_test evaluating 20 ms cue firing expectation now IIUC"

Matt_Wolenetz: I wasn't at TPAC last year. Not fully up to speed on this. Interested with intersection with MSE.
… For the emsg extraction, what MSE implementations might need to do is to populate cues from in-band messages, how would it determine the start and end times of emsg cues?
… [going into details of earliest presentation time and timeline]
… At what point do we now interoperably that we don't need to parse an emsg box because the time is already past.
… Should we support both version of emsg boxes?
… I'm having trouble understanding how times work with movie times.

cpn: In terms of the versioning, both 0 and 1 would be of interest, I believe.
… Timeline mapping varies between the two versions.

<Zakim> zacharycava, you wanted to comment on the timeline mapping piece

zacharycava: You're right, we're allowing messages before the timestamp gets defined. We should remove that.
… Some association between the message and the chunk seems doable.

cpn: I think it's fair to say that we've been looking more at the message payload, but we haven't looked so much into how the timing aspects would need to work.

Matt_Wolenetz: The specification for MSE gives a lot of flexibility to implementations about how and when they parse things and start.
… There might be interoperability issues depending on how and when we expose parsed cues
… Are we in agreement about a potential MSE integration with emsg timestamp offset?
… If you're shifting audio/video by 10s, should we shift emsg start time by 10s as well?
… My assumption would be to make it more consistent with MSE, but that might come as a surprise to applications with legacy content that wouldn't expect things to happen this way.
… Feature detection is another big question here.
… All of the various types of cues. How can I understand which types are supported? Is there a registry? Is it arbitrary?

cpn: I know that the DASH-IF is collecting info on the different types that are in use. Some of that is application-specific.
… We would not necessarily expect user agents to understand the message formats. The user agent could hand over the unparsed message and the application would do the parsing.
… Mounir was wary about adding another layer of parsing last year, which could trigger interoperability issues.
… The user agent wouldn't need to parse the message type, and could just pass the message over to the application.

Matt_Wolenetz: In the proposal, there is a sample where the message is filtered by the UA. That presumes that there is some specification for the UA to do that.

cpn: I think that part may need updating. Part of the thoughts process we went through.

JerNoble: Two cents about Matt's questions. Yes, you would want to time shift cues.
… Ad insertion use case would require you to match the media timestamps.
… Webkit DataCue implementation does not support emsg boxes. Primarily comes from HLS. Different characteristics and different payload.
… Just wanted to point out particularities of the Webkit implementation.

zacharycava: Second Jer's comment on shifting times.
… I also wanted to provide context. In Hulu, we do actually implement extensions on top of MSE/EME. Something like DataCue is what we explored previously just so that we could expose in-band data without requiring JS code to handle the parsing. Very small payload for memory on low-memory devices.
… I just want to say that there is a lot of practicality in this.
… Especially in constrained environments.

Matt_Wolenetz: I would like to know where I can discuss feature detection. What Datacue types might need support. HLS produced ones? ID3? Emsg boxes? We might need something like "canPlayType". I imagine that this would be needed.

JerNoble: I'm entertained by the idea of exposing some feature detection mechanism for DataCue.

cpn: OK, thanks for the feedback, Matt. We'll be looking into that in the incubation group. I'd like to invite you to participate there.

TextTrack support in MSE

cpn: Anything that you could share about TextTrack support in MSE

Matt_Wolenetz: [audio chopped]. We removed experimental support for TextTrack because of issues.
… Many sites have already developed their own out-of-band solutions for presenting the tracks. This is not necessarily a justification for not including it.

cyril: I'm wondering how this would interact with new TextTrackCue proposal?
… I could imagine MSE handing over unparsed cue to the app, app doing the parsing, creating the DocumentFragment, and then the UA would render.

<nigel> +1

cyril: Is that acceptable?

JerNoble: I don't know what the benefit is about using MSE instead of fetching raw text and doing parsing on the side.

cyril: CMAF supports in-band tracks. Supporting this is beneficial. The new TextTrackCue proposal is also made to make the UA avoid doing the parsing.

JerNoble: Is your expectation that the UA would parse the TTML?

cyril: I don't think that the UA would have to do the TTML at all.

JerNoble: Ah, get it, you're talking about TTML in MP4. I see.

Will_Law: We want MSE do all the component parsing. That's the architecture we'd like to see.

cpn: Thank you everybody for joining today. The DataCue incubation continues in WICG. I may reach out to people directly. Thanks for the updates on TextTrackCue, and thank you Nigel for the part on Audio Description!
… That's it for today. More meeting coming up next week, please check the schedule!