From W3C Wiki

This well attended session discussed the idea of a new kind of Text Track Cue which would allow any fragment of HTML+CSS to be used to modify the display on a target element in synchronisation with a media element's timeline. The idea originated with Opera and has also been discussed in TTWG and WHATWG: to have a cue whose payload data is a fragment of HTML and whose default onenter() handler would attach that fragment to the target element, e.g. a video, for example to update the captions or subtitles, and whose onexit() would somehow remove or clear it. This approach is used for example in dash.js to present subtitles now, but the only browser supported cue type is a VTTCue that has to be overloaded to do this, and the extra VTT style attributes are ignored. A cleaner solution would be to have a supported generic cue type that doesn't include an initial list of VTT specific styles.

Previous discussions have suggested that this approach could have unintended consequences and would need to be sanitised; for example it would need to be clear that the changes do not create new navigation contexts, and there may be risks associated with running scripts embedded in the HTML, or loading external resources like images - those risks were not enumerated.

Different views were expressed, not all in agreement:

  • this can be done already using the current text track interface and VTTCue, setting the Text Track's kind attribute to "metadata" and handling the onenter() with bespoke javascript to attach the result of getCueAsHtml().
  • use of "metadata" tracks subverts the intent behind the label "metadata" when the purpose of the tracks is to display content.
  • a browser-native implementation not requiring use of js to attach the HTML payload to a target element would give browsers an opportunity to optimise for display rendering to achieve the required cue timing for example by pre-rendering.
  • for a javascript metadata track based solution the browser may need to expose the user preferences for showing or not showing subtitles so that non-native javascript implementations can know what to do.
  • a native implementation could allow browsers to use default player controls: bespoke/custom subtitle handling is generally likely to go alongside custom player controls.
  • a generic solution would allow other formats than VTT to be used more easily. For example it is easier to translate from TTML to HTML than from TTML to VTT. Conversely it should also be easy to implement VTT by translation to HTML. This would move the effort for browser implementors away from handling VTT and allow more agility in changing file formats since file parsing and processing into HTML can be done in javascript.
  • to handle the risks a sandbox approach could be used similar to what is available for iframe (or indeed by simply using an iframe and its sandbox directly)
  • browsers should not be expected to auto-load external resources like images from html attached in this way because that could create a privacy issue e.g. it would inform the image source server where in the related video the user is watching.
  • but also in opposition to that: browsers should be expected to auto-load external resources attached in this way, since the origin of the source data can be considered 'trusted'.
  • we should not try to subset html into a 'safe' set but rather build up a required and safe set from a blank starting point - we tried the former approach before and it caused problems.
  • the idea of HTMLCue is a good one from an architectural perspective.

Two actions were noted:

  1. File a feature request to expose user preferences about subtitles to javascript
  2. Work on a prototype using the TextTrack kind="metadata" and highlight any limitations or issues discovered

Other next steps include summarising the HTMLCue proposal in a clear document.