As a result of the 105th MPEG meeting (see Press Release), MPEG has concluded its study of the carriage of Timed Text in the ISO Base Media File Format (MP4). The study resulted in draft standards for the carriage of WebVTT and TTML content that have reached Final Draft stage (FDAM 2 for 14996-12/15444-12 and FDIS for 14496-30). They are considered complete and are submitted to National Bodies for final vote. This post gives an overview of these draft documents.
MP4 basics and timed-text related specifications
An MP4 file is logically made of tracks. An MP4 track is a logical structure organized into samples and sample descriptions. Samples carry information that is valid from a given time and for a given a duration. Samples carry data that is continuous (no gap in time between samples) and non-overlapping (the end of a sample is the start of the next sample). This has good properties, and in particular allows random access into the track. A sample description carries information that is valid for the duration of several samples, typically for the whole track.
The amendment to Part 12 covers the basic syntax and semantics for a set of new text track types for a broad range of timed text formats. In particular, two track types have been defined: the ‘text’ type for track content that results in text rendering only; and the ‘subt’ type for track content that may result in text and graphics rendering.
Part 30 provides specific guidance for two popular timed text format technologies defined by W3C – Timed Text Markup Language (TTML) and Web Video Text Tracks (WebVTT) enabling use of those formats in context such as MPEG-DASH or HTML5 Media Source Extensions.
Carriage of WebVTT
In a nutshell, WebVTT content is carried in MP4 files using ‘tracks’, of type ‘text’. WebVTT header and metadata are logically carried in the sample description while WebVTT cues are in samples.
To enable carriage of overlapping WebVTT cues in MP4 tracks, WebVTT cues are split into non-overlapping cues and gathered into samples, as explained below. MP4 Parsers will typically do the reverse operation so that the carriage in MP4 is transparent to the application. More generally, the carriage has been designed such that the WebVTT content after import/export in an MP4 is identical, including comments and text content that is not valid according to the syntax but processable by a conformant WebVTT parser.
As an example, if you take the cues as depicted in the figure above, the cues will be split and organized into samples as depicted in the figure below. Cue 1 is split into 2 cues (1a and 1b, the boundary being at time 3). Cue 1a becomes sample 1. Cue 2 is split into 3 cues (2a, 2b and 2c: boundaries being at time 5, and time 8). Cue 1b and Cue 2a form Sample 2. Cue 2b forms Sample 3. And so on.
Additionally, special care has been taken to avoid cue timing information duplication. In particular, the cue start times and end times are stored at the sample level, not at the cue level, enabling editing of the track, with the same tool as a video track, including when cues contain internal timestamps.
In terms of physical bytes, WebVTT data (header, cues, …) is wrapped into ISO structures called boxes. There is a box for header and metadata, for cue ids, cue settings, cue payloads, and in-between cue text (such as comments).
Tools are being developed to import/export WebVTT content into ISO files, such as MP4Box of the GPAC project.
Carriage of TTML
In a nutshell, TTML content is carried in MP4 files using ‘tracks’, of type ‘subt’. A TTML sample carries an entire XML document, and may also contain or reference additional resources such as images or fonts, potentially shared between documents/samples.
Regarding timing, each TTML document contains the elements to be presented during the sample validity, but the time values used in the document are relative to the start of the track. In other words, 00:00:00 in the XML does not mean the start of the sample in which the document is carried. So in the example below, the document describes rendering between times 00:31:00 and 00:32:00 but the sample may have a start time before the time 00:31:00 and may last after 00:32:00. Authors should be careful, that elements associated with times outside the sample validity will not be rendered.
<tt> <body> <div> <p begin=“00:31:00” end=“00:32:00”>31-32 minutes</p> </div> </body> </tt>