User:Geguchi/Ad Insertion

This page is used for internal collaboration only and is not ready for general media TF review.

Editing is currently in progress.

Use Cases

This section outlines ad insertion use cases that are not adequately addressed by the current MSE APIs.

High Level Use Cases

1) Seamless Content Insertion

2) Seamelss Content Replacement

For the following use cases, assume that:

1) User wishes to play content A (or a portion of A) then perform a seamless switch to content B (or a portion of B). A seamless switch is defined as continuous playback where no visible/audible pause during the transition from A to B occurs whether due to late buffering, late codec initialization, or other causes.

2) User agent is given enough advance notice that a seamless switch is theoretically possible.

3) A and B may have been encoded with different encoders or with different settings that cannot be synchronized. This may occur for reasons such as:

a) A and B originate from different sources.

Example: A is primary content that originates from an MVPD. B is ad content that originates one of many vendors outside the MVPD. The encoder models and settings used to encode A and B may differ since it is impractical for the MVPD to synchronize encoder settings and models across all ad vendors.

b) A and B were encoded at different times.

Example: An MVPD encodes A using encoder model M1. At a later time, the MVPD decides to begin using encoder model M2 across their organization. The MVPD encoders content B using M2. Details such as track IDs may differ between M1 and M2. Due to the size of the MVPD's content library, it is impractical to re-encode all legacy content.

Use case 1) Seamless switch with different number of tracks

Priority: HIGH

A is multiplexed content with x tracks of type T. B is multiplexed content with y tracks also of type T. x != y

Example: A is main content with English/Spanish audio, B is an ad with only English. Ad content often contains a smaller number of language tracks than main content.

Use case 2) Seamless switch with different track IDs

Priority: HIGH

A is multiplexed content with x tracks of type T. B is also multiplexed content with x tracks of type T. The track IDs of A differ from those of B. x > 1

Example: A is main content with English/Spanish audio with track IDs 1 and 2. B is an ad with English/Spanish audio with track IDs 1 and 3. Since ad content is typically encoded independently from primary content and track IDs are not standardized, there is no guarantee they will use consistent schemes for designating track IDs.

Use case 3) Seamless switch with different codecs

Priority: HIGH

A uses codec C1, B uses codec C2.

Examples:

A is main content with H264 video and Dolby audio. B is an ad with H264 video and AAC audio. Dolby support is common among set-top boxes. Primary content is often encoded with Doby audio. Dolby audio is not as common among ad content.
A is main content with HEVC video and AAC audio. B is an ad with H264 video and AAC audio. As HEVC gains adoption, ad and primary content will often be a mixture of HEVC and H264.

The use case of switching across profiles and levels does not typically cause issues since "higher" profiles/level are generally supersets of lower profiles/levels.

Use case 4) Seamless switch between multiplexed and demultiplexed content

Priority: LOW

A consists of demultiplexed tracks. B consists of multiplexed tracks.

Use case 5) Seamless switch between different byte stream formats

Priority: LOW

A is of a different byte stream format from B.

Example: A is formatted as mp4. B is formatted as m2ts. This is common when dealing with legacy VoD content.

Specification Gaps

This section outlines gaps in the current specification in addressing the above use cases.

The use cases encounter difficulty due to step #3 of the initialization segment received algorithm (Section 3.5.8). Step #3 reads:

3. If the first initialization segment received flag is true, then run the following steps:

1. Verify the following properties. If any of the checks fail then run the end of stream algorithm with the error parameter set to "decode" and abort these steps.

* The number of audio, video, and text tracks match what was in the first initialization segment.

* The codecs for each track, match what was specified in the first initialization segment.

* If more than one track for a single type are present (ie 2 audio tracks), then the Track IDs match the ones in the first initialization segment.

A possible workaround (Workaround A) might be to utilize multiple MediaSource objects, but this approach has multiple issues:

Initiating the switch using TextTrackCue is not guaranteed to be frame accurate. Testing suggests that current implementations are not frame accurate.
The user agent is unaware of the switch until the time of the switch and therefore cannot perform buffering or decoder initialization ahead of time.
User Agent implementations may not support multiple MediaSource objects. Section 2.2 specifies this as a "quality of implementation issue."

A second workaround (Workaround B) might be to intiialize SourceBuffers of all possible audio and video codecs. This approach has multiple issues:

Issues #1 and #2 of Workaround A still apply.
User Agent implementations may not support enough SourceBuffers. Section 2.2 specifies this as a "quality of implementation issue."
The number of required SourceBuffers may be unmanageably large, especially when considering use cases that involve a combination of multiplexed/demultiplexed content and multiple possible codecs.
The complexity required for a user to implement this solution is high.

A third workaround (Workaround C) might be to add a JavaScript container format parser to Workaround B. This approach has multiple issues:

The complexity required for a user to implement this solution is very high. The Javascript code must parse the container format, demultiplex tracks, and remap track IDs in order to work around model limitations.
For the multi-codec use case, issues #1-4 of Workaround B still apply. This approach reduces the number of necessary SourceBuffers, but potentially requires more SourceBuffers than may be supported. Switches are not guranteed be frame accurate.