Requirements for Media Timed Events

Abstract

This document collects use cases and requirements for improved support for timed events related to audio or video media on the web, where synchronization to a playing audio or video media stream is needed, and makes recommendations for new or changed web APIs to realize these requirements. The goal is to extend the existing support in HTML for text track cues to add support for dynamic content replacement cues and generic data cues that drive synchronized interactive media experiences, and improve the timing accuracy of rendering of web content intended to be synchronized with audio or video media playback.

3. Use cases

Media timed events carry information that is related to points in time or periods of time on the media timeline, which can be used to trigger retrieval and/or rendering of web resources synchronized with media playback. Such resources can be used to enhance user experience in the context of media that is being rendered. Some examples include display of social media feeds corresponding to a live video stream such as a sporting event, banner advertisements for sponsored content, accessibility-related assets such as large print rendering of captions.

The following sections describe a few use cases in more detail.

3.1 Dynamic content insertion

A media content provider wants to allow insertion of content, such as personalised video, local news, or advertisements, into a video media stream that contains the main program content. To achieve this, media timed events can be used to describe the points on the media timeline, known as splice points, where switching playback to inserted content is possible.

The Society for Cable and Televison Engineers (SCTE) specification "Digital Program Insertion Cueing for Cable" [SCTE35] defines a data cue format for describing such insertion points. Use of these cues in MPEG-DASH and HLS streams is described in [SCTE35], sections 12.1 and 12.2.

This use case typically requires frame accuracy, so that inserted content is played at the right time, and continuous playback is maintained.

3.2 Audio stream with titles and images

A media content provider wants to provide visual information alongside an audio stream, such as an image of the artist and title of the current playing track, to give users live information about the content they are listening to.

HLS timed metadata [HLS-TIMED-METADATA] uses in-band ID3 metadata to carry the artist and title information, and image content. RadioVIS in DVB ([DVB-DASH], section 9.1.7) defines in-band event messages that contain image URLs and text messages to be displayed, with information about when the content should be displayed in relation to the media timeline.

The visual information should be rendered within a hundred milliseconds or so to maintain good synchronization with the audio content.

3.3 Control messages for media streaming clients

MPEG-DASH defines a number of control messages for media streaming clients (e.g., libraries such as dash.js). These messages are carried in-band in the media container files. Use cases include:

Signalling that the DASH manifest document (MPD) has expired and should be updated. This method is used as an alternative to setting a cache duration in the response to the HTTP request for the manifest, so the client can refresh the manifest document when it actually changes, as opposed to waiting for a cache duration expiry period to elapse. This also has the benefit of reducing the load on HTTP servers caused by frequent server requests.
Analytics callback events, to allow content providers to track media playback. In response to this message, to the client makes an HTTP request to a URL specified in the media timed event data.
Signalling early termination of the media presentation, for cases where the media ends earlier than expected from the current DASH manifest document.

Reference: M&E IG call 1 Feb 2018: Minutes, [DASH-EVENTING].

3.4 Subtitle and caption rendering synchronization

A subtitle or caption author wants ensure that subtitle changes are aligned as closely as possible to shot changes in the video. The BBC Subtitle Guidelines [BBC-SUBTITLE] describes authoring best practices. In particular, in section 6.1 authors are advised:

"[...] it is likely to be less tiring for the viewer if shot changes and subtitle changes occur at the same time. Many subtitles therefore start on the first frame of the shot and end on the last frame."

The NorDig technical specifications for DVB receivers for the Nordic and Irish markets [NORDIG], section 7.3.1, mandates that receivers support TTML in MPEG-2 Transport Streams. The presentation timing precision for subtitles is specified as being within 2 frames.

Another important use case is maintaining synchronization of subtitles during program content with fast dialog. The BBC Subtitle Guidelines, section 5.1 says:

"Impaired viewers make use of visual cues from the faces of television speakers. Therefore subtitle appearance should coincide with speech onset. [...] When two or more people are speaking, it is particularly important to keep in sync. Subtitles for new speakers must, as far as possible, come up as the new speaker starts to speak. Whether this is possible will depend on the action on screen and rate of speech."

A very fast word rate, for example, 240 words per minute, corresponds on average to one word every 250 milliseconds.

3.5 Synchronized map animations

A user records footage with metadata, including geolocation, on a mobile video device, e.g., drone or dashcam, to share on the web alongside a map, e.g., OpenStreetMap.

[WEBVMT] is an open format for metadata cues, synchronized with a timed media file, that can be used to drive an online map rendered in a separate HTML element alongside the media element on the web page. The media playhead position controls presentation and animation of the map, e.g., pan and zoom, and allows annotations to be added and removed, e.g., markers, at specified times during media playback. Control can also be overridden by the user with the usual interactive features of the map at any time, e.g., zoom. The rendering of the map animation and annotations should usually be to within a hundred milliseconds or so to maintain good synchronization with the video. However, a shot change which instantly moves to a different location would require the map to be updated simultaneously, ideally with frame accuracy.

Concrete examples are provided by the tech demos at the WebVMT website.

3.6 Media stream with video and synchronized graphics

A content provider wants to provide synchronized graphical elements that may be rendered next to or on top of a video.

For example, in a talk show this could be a banner, shown in the lower third of the video, that displays the name of the guest. In a sports event, the graphics could show the latest lap times or current score, or highlight the location of the current active player. It could even be a full-screen overlay, to blend from one part of the program to another.

The graphical elements are described in a stream or file containing media timed events for start and end time of each graphical element, similar to a subtitle stream or file. A graphic renderer takes this data as input and renders it on top of the video image according to the media timed events.

The purpose of rendering the graphical elements on the client device, rather than rendering them directly into the video image, is to allow the graphics to be optimized for the device's display parameters, such as aspect ratio and orientation. Another use case is adapting to user preferences, for localization or to improve accessibility.

This use case requires frame accurate synchronization of the content being rendered over the video.

3.7 Live event coverage

Media content providers often cover live events where the timing of particular segments, although often pre-scheduled, can be subject to last minute change, or may not be known ahead of time.

The media content provider uses media timed events together with their video stream to add metadata to annotate the start and (where known) end times of each of these segments. This metadata drives a user interface that allows users to see information about the current playing and upcoming segments.

Examples of the dynamic nature of the timing include:

A baseball game, where the total duration is not known in advance, but after the game ends is followed by a period of known duration for post-game interviews.
An award show, which is scheduled for a given duration, runs over time and so the exact end time becomes unknown.
A regularly scheduled one-hour news program, which is extended by 30 minutes to cover a breaking news story, or is cut short for an unscheduled government announcement.

3.8 Presentation of auxiliary content in live media

During a live media presentation, dynamic and unpredictable events may occur which cause temporary suspension of the media presentation. During that suspension interval, auxiliary content such as the presentation of UI controls and media files, may be unavailable. Depending on the specific user engagement (or not) with the UI controls and the time at which any such engagement occurs, specific web resources may be rendered at defined times in a synchronized manner. For example, a multimedia A/V clip along with subtitles corresponding to an advertisement, and which were previously downloaded and cached by the UA, are played out.

5. Gap analysis

This section describes gaps in existing existing web platform capabilities needed to support the use cases and requirements described in this document. Where applicable, this section also describes how existing web platform features can be used as workarounds, and any associated limitations.

5.1 MPEG-DASH and ISO BMFF emsg events

The DataCue API has been previously discussed as a means to deliver in-band media timed event data to web applications, but this is not implemented in all of the main browser engines. It is included in the 18 October 2018 HTML 5.3 draft [HTML53-20181018], but is not included in [HTML]. See discussion here and notes on implementation status here.

WebKit supports a DataCue interface that extends HTML5 DataCue with two attributes to support non-text metadata, type and value.

Example 1

interface DataCue : TextTrackCue {
  attribute ArrayBuffer data; // Always empty

  // Proposed extensions.
  attribute any value;
  readonly attribute DOMString type;
};

type is a string identifying the type of metadata:

WebKit `DataCue` metadata types
`"com.apple.quicktime.udta"`	QuickTime User Data
`"com.apple.quicktime.mdta"`	QuickTime Metadata
`"com.apple.itunes"`	iTunes metadata
`"org.mp4ra"`	MPEG-4 metadata
`"org.id3"`	ID3 metadata

and value is an object with the metadata item key, data, and optionally a locale:

Example 2

value = {
  key: String
  data: String | Number | Array | ArrayBuffer | Object
  locale: String
}

Neither [MSE-BYTE-STREAM-FORMAT-ISOBMFF] nor [INBANDTRACKS] describe handling of emsg boxes.

On resource constrained devices such as smart TVs and streaming sticks, parsing media segments to extract event information leads to a significant performance penalty, which can have an impact on UI rendering updates if this is done on the UI thread. There can also be an impact on the battery life of mobile devices. Given that the media segments will be parsed anyway by the user agent, parsing in JavaScript is an expensive overhead that could be avoided.

Avoiding parsing in JavaScript is also important for low latency video streaming applications, where minimizing the time taken to pass media content through to the media element's playback buffer is essential.

[HBBTV] section 9.3.2 describes a mapping between the emsg fields described above and the TextTrack and DataCue APIs. A TextTrack instance is created for each event stream signalled in the MPD document (as identified by the schemeIdUri and value), and the inBandMetadataTrackDispatchType TextTrack attribute contains the scheme_id_uri and value values. Because HbbTV devices include a native DASH client, parsing of the MPD document and creation of the TextTracks is done by the user agent, rather than by application JavaScript code.

5.2 `TextTrackCue`s with unbounded duration

It is not currently possible to create a TextTrackCue that extends from a given start time to the end of a live media stream. If the stream duration is known, the content author can set the cue's endTime equal to the media duration. However, for live media streams, where the duration is unbounded, it would be useful to allow content authors to specify that the TextTrackCue duration is also unbounded, e.g., by allowing the endTime to be set to Infinity. This would be consistent with the media element's duration property, which can be Infinity for unbounded streams.

5.3 Synchronized rendering of web resources

In browsers, non media web rendering is handled through repaint operations at a rate that generally matches the display refresh rate (e.g., 60 times per second), following the user's wall clock. A web application can schedule actions and render web content at specific points on the user's wall clock, notably through Performance.now(), setTimeout(), setInterval(), and requestAnimationFrame().

In most cases, media rendering follows a different path, be it because it gets handled by a dedicated background process or by dedicated hardware circuitry. As a result, progress along the media timeline may follow a clock different from the user's wall clock. [HTML] recommends that the media clock approximate the user's wall clock but does not require it to match the user's wall clock.

To synchronize rendering of web content to a video with frame accuracy, a web application needs:

A way to track progress along the media timeline with sufficient precision. The actual precision required depends on the use case. Subtitles for video are typically authored against video at the nominal video frame rate, e.g., 25 frames per second, which corresponds to 40 milliseconds per frame, even when the actual video frame rate gets adjusted dynamically ([EBU-TT-D], Annex E). This suggests a 20 milliseconds precision, or half of the duration of a typical video frame, to render subtitles with frame accuracy.
In cases where synchronization needs to occur at frame boundaries, a way to tie the rendering of non media content, typically done at the display refresh rate, with the rendering of a video frame. This need does not replace the former one: a web application that needs to render web content at media frame boundaries may also need to perform actions at specific points on the media timeline regardless of when the next frame gets rendered.
A way to prepare the web content to be rendered ahead of time. This may involve fetching resources, such as images or other related media, to be rendered.

The following sub-sections discusses mechanisms currently available to web applications to track progress on the media timeline and render content at frame boundaries.

5.3.1 Using cues to track progress on the media timeline

Cues (e.g., TextTrackCue and VTTCue) are units of time-sensitive data on a media timeline [HTML]. The time marches on steps in [HTML] control the firing of cue DOM events during media playback. Time marches on is specified to run "when the current playback position of a media element changes" but how often this should happen is unspecified. In practice it has been found that the timing varies between browser implementations, in some cases with a delay up to 250 milliseconds (which corresponds to the lowest rate at which timeupdate events are expected to be fired).

There are two methods a web application can use to handle cues:

Add an oncuechange handler function to the TextTrack and inspect the track's activeCues list. Because activeCues contains the list of cues that are active at the time that time marches on is run, it is possible for cues to be missed by a web application using this method, where cues appear on the media timeline between successive executions of time marches on during media playback. This may occur if the cues have short duration, or by a long-running event handler function.
Add onenter and onexit handler functions to each cue. The time marches on steps guarantee that enter and exit events will be fired for all cues, including those that appear on the media timeline between successive executions of time marches on during media playback. The timing accuracy of these events varies between browser implementations, as the firing of the events is controlled by the rate of execution of time marches on.

An issue with handling of text track and data cue events in HbbTV was reported in 2013. HbbTV requires the user agent to implement an MPEG-DASH client, and so applications must use the first of the above methods for cue handling, which means that applications can miss cues as described above. A similar issue has been filed against the HTML specification.

5.3.2 Using `timeupdate` events from the media element

Another approach to synchronizing rendering of web content to media playback is to use the timeupdate DOM event, and for the web application to manage the media timed event data to be triggered, rather than use the text track cue APIs in [HTML]. This approach has the same synchronization limitations as described above due to the 250 millisecond update rate specified in time marches on, and so is explicitly discouraged in [HTML]. In addition, the timing variability of timeupdate events between browser engines makes them unreliable for the purpose of synchronized rendering of web content.

5.3.3 Polling the current position on the media timeline

Synchronization accuracy can be improved by polling the media element's currentTime property from a setInterval() callback, or by using requestAnimationFrame() for greater accuracy. This technique can be useful in where content should be animated smoothly in synchronicity with the media, for example, rendering a playhead position marker in an audio waveform visualization, or displaying web content at specific points on the media timeline. However, the use of setInterval() or requestAnimationFrame() for media synchronized rendering is CPU intensive.

5.3.4 Detecting when the next media frame will be rendered

[HTML] does not expose any precise mechanism to assess the time, from a user's wall clock perspective, at which a particular media frame is going to be rendered. A web application may only infer this information by looking at the media element's currentTime property to infer the frame being rendered and the time at which the user will see the next frame. This has several limitations:

currentTime is represented as a double value, which does not allow to identify individual frames due to rounding errors. This is a known issue.
currentTime is updated at a user-agent defined rate (typically the rate at which time marches on runs), and is kept stable while scripts are running. When a web application reads currentTime, it cannot tell when this property was last updated, and thus cannot reliably assess whether this property still represents the frame currently being rendered.

6. Recommendations

This section describes recommendations from the Media & Entertainment Interest Group for the development of a generic media timed event API, and associated synchronization considerations.

6.1 Subscribing to receive media timed event cues

The API should allow web applications to subscribe to receive specific types of media timed event cue. For example, to support MPEG-DASH emsg and MPD events, the cue type is identified by a combination of the scheme_id_uri and (optional) value. The purpose of this is to make receiving cues of each type opt-in from the application's point of view. The user agent should deliver only those cues to a web application for which the application has subscribed. The API should also allow web applications to unsubscribe from specific cue types.

6.2 Out-of-band events

To be able to handle out-of-band media timed event cues, including MPEG-DASH MPD events, the API should allow web applications to create and add timed data cues to the media timeline, to be triggered by the user agent. The API should allow the web application to provide all necessary parameters to define the cue, including start and end times, cue type identifier, and data payload. The payload should be any data type (e.g., the set of types supported by the WebKit DataCue).

6.3 Event triggering

For those events that the application has subscribed to receive, the API should:

Generate a DOM event when an in-band media timed event cue is parsed from the media container or media stream (DASH-IF on-receive mode).
Generate DOM events when the current media playback position reaches the start time and the end time of a media timed event cue during playback (DASH-IF on-start mode). This applies equally to cues generated by the user agent when parsed from the media container and cues added by the web application.

The API should provide guarantees that no media timed event cues can be missed during linear playback of the media.

6.4 In-band media timed event processing

We recommend updating [INBANDTRACKS] to describe handling of in-band media timed events supported on the web platform, possibly following a registry approach with one specification per media format that describes the details of how media timed events are carried in that format.

6.5 MPEG-DASH events

We recommend that browser engines support MPEG-DASH emsg in-band events and MPD out-of-band events, as part of their support for the MPEG Common Media Application Format (CMAF) [MPEGCMAF].

6.6 Cues with unbounded duration

To support cues with unknown end time, where the cue is active from its start time to the end of the media stream, we recommend that the TextTrackCue interface be modified to allow the cue duration to be unbounded.

6.7 Updating media timed events

We recommend that the API allows media timed event information to be updated, such as an event's position on the media timeline, and its data payload. Where the media timed event is updated by the user agent, such as for in-band events, we recommend that the API allows the web application to be notified of any changes.

6.8 Synchronization

In order to achieve greater synchronization accuracy between media playback and web content rendered by an application, the time marches on steps in [HTML] should be modified to allow delivery of cue onenter and onexit DOM events within 20 milliseconds of their positions on the media timeline.

Additionally, to allow such synchronization to happen at frame boundaries, we recommend introducing a mechanism that would allow a web application to accurately predict, using the user's wall clock, when the next frame will be rendered (e.g., as done in the Web Audio API).

Requirements for Media Timed Events

W3C Interest Group Note 25 June 2020

Abstract

Status of This Document

1. Introduction

2. Terminology

3. Use cases

3.1 Dynamic content insertion

3.2 Audio stream with titles and images

3.3 Control messages for media streaming clients

3.4 Subtitle and caption rendering synchronization

3.5 Synchronized map animations

3.6 Media stream with video and synchronized graphics

3.7 Live event coverage

3.8 Presentation of auxiliary content in live media

5. Gap analysis

5.1 MPEG-DASH and ISO BMFF emsg events

5.2 `TextTrackCue`s with unbounded duration

5.3 Synchronized rendering of web resources

5.3.1 Using cues to track progress on the media timeline

5.3.2 Using `timeupdate` events from the media element

5.3.3 Polling the current position on the media timeline

5.3.4 Detecting when the next media frame will be rendered

6. Recommendations

6.1 Subscribing to receive media timed event cues

6.2 Out-of-band events

6.3 Event triggering

6.4 In-band media timed event processing

6.5 MPEG-DASH events

6.6 Cues with unbounded duration

6.7 Updating media timed events

6.8 Synchronization

7. Acknowledgments

A. References

A.1 Informative references

Requirements for Media Timed Events

W3C Interest Group Note 25 June 2020

Abstract

Status of This Document

1. Introduction

2. Terminology

3. Use cases

3.1 Dynamic content insertion

3.2 Audio stream with titles and images

3.3 Control messages for media streaming clients

3.4 Subtitle and caption rendering synchronization

3.5 Synchronized map animations

3.6 Media stream with video and synchronized graphics

3.7 Live event coverage

3.8 Presentation of auxiliary content in live media

4. Related industry specifications

4.1 MPEG Common Media Application Format (CMAF)

4.2 MPEG-DASH

4.3 HTTP Live Streaming

4.4 HbbTV

4.5 DASH Industry Forum APIs for Interactivity

4.6 SCTE-35

4.7 MPEG Carriage of Web Resources in ISO BMFF

4.8 WebVTT

5. Gap analysis

5.1 MPEG-DASH and ISO BMFF emsg events

5.2 TextTrackCues with unbounded duration

5.3 Synchronized rendering of web resources

5.3.1 Using cues to track progress on the media timeline

5.3.2 Using timeupdate events from the media element

5.3.3 Polling the current position on the media timeline

5.3.4 Detecting when the next media frame will be rendered

6. Recommendations

6.1 Subscribing to receive media timed event cues

6.2 Out-of-band events

6.3 Event triggering

6.4 In-band media timed event processing

6.5 MPEG-DASH events

6.6 Cues with unbounded duration

6.7 Updating media timed events

6.8 Synchronization

7. Acknowledgments

A. References

A.1 Informative references

5.2 `TextTrackCue`s with unbounded duration

5.3.2 Using `timeupdate` events from the media element