Bug 13357 - <video>: Additional AudioTrack.kind categories are needed to identify tracks where audio descriptions are premixed with main dialogue.
Summary: <video>: Additional AudioTrack.kind categories are needed to identify tracks ...
Status: RESOLVED FIXED
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec (show other bugs)
Version: unspecified
Hardware: PC Windows XP
: P2 enhancement
Target Milestone: ---
Assignee: Silvia Pfeiffer
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords: a11y, a11ytf, media
Depends on: 12544
Blocks:
  Show dependency treegraph
 
Reported: 2011-07-25 21:14 UTC by Bob Lund
Modified: 2012-10-14 08:39 UTC (History)
14 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bob Lund 2011-07-25 21:14:23 UTC
Problem statement:

Audio description tracks can come in two forms. In one form, the audio description is a separate track from the video main dialogue. In another form, used by satellite and cable television, audio descriptions are premixed with the main dialogue track. U.S. Cable (and Canada) follow [1] where section 6.4 Visually Impaired (VI) specifies The VI associated service is a complete program mix containing music, effects, dialogue, and additionally a narrative description of the picture content. The current HTML5 specification Editors Draft defines one kind of audio description track, description, which identifies an audio track of the first form, i.e. a separate audio description track. Content authored for satellite and cable television (audio descriptions pre-mixed with the main dialogue) that is delivered to an HTML5 browser also needs to be identified by AudioTrack.kind categories. 

Specification sections affected:

HTML5 A vocabulary and associated APIs for HTML and XHTML section AudioTrackList and VideoTrackList objects.

One suggested solution:

Define two new AudioTrack.kind categories in section AudioTrackList and VideoTrackList objects: main+description is the main dialogue track with embedded audio descriptions, translation+description is a translated version of the main track with embedded audio descriptions.

[1]"ATSC A/53: Part-5 standard for audio coding & delivery" http://www.atsc.org/cms/standards/a53/a_53-Part-5-2010.pdf
Comment 1 Michael[tm] Smith 2011-08-04 05:12:16 UTC
mass-move component to LC1
Comment 2 Ian 'Hixie' Hickson 2011-08-19 19:47:06 UTC
How is this kind distinguished in existing video formats?

That is, what types would this map to in H.264, WebM, or Ogg files?
Comment 3 Silvia Pfeiffer 2011-08-20 03:36:20 UTC
(In reply to comment #2)
> How is this kind distinguished in existing video formats?
> 
> That is, what types would this map to in H.264, WebM, or Ogg files?

In Ogg such a track would just be under the role "main" because such mixed content is not the best way to publish content - it's better to keep such content separate. Once mixed, it's basically impossible to get back to the individual parts and e.g. remove the audio description again. As such it is the new "main". It has also not been a common case to deal with yet.

This doesn't quite match with the user experience though and where such content is delivered, it has to be made clear to the user that the main content has such other information in it. In the past I have seen it even mentioned in the title displayed above the video and explicitly described in the description to point out to the user what they have to expect from the content.

In a situation where a program has to pick an adequate piece of content, e.g. content that actually has an audio description, just using kind=main is indeed insufficient. I'm prepared to suggest to Ogg to add the use of "+" to the roles attribute to enable use cases such as role="audio/main+audio/audiodesc" or role="video/main+video/sign".
Comment 4 Bob Lund 2011-08-25 20:21:03 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > How is this kind distinguished in existing video formats?
> > 
> > That is, what types would this map to in H.264, WebM, or Ogg files?
> 
> In Ogg such a track would just be under the role "main" because such mixed
> content is not the best way to publish content - it's better to keep such
> content separate. Once mixed, it's basically impossible to get back to the
> individual parts and e.g. remove the audio description again. As such it is the
> new "main". It has also not been a common case to deal with yet.
> 
> This doesn't quite match with the user experience though and where such content
> is delivered, it has to be made clear to the user that the main content has
> such other information in it. In the past I have seen it even mentioned in the
> title displayed above the video and explicitly described in the description to
> point out to the user what they have to expect from the content.
> 
> In a situation where a program has to pick an adequate piece of content, e.g.
> content that actually has an audio description, just using kind=main is indeed
> insufficient. I'm prepared to suggest to Ogg to add the use of "+" to the roles
> attribute to enable use cases such as role="audio/main+audio/audiodesc" or
> role="video/main+video/sign".

In DASH, which is still a work in progress, it appears <role> will be used to identify if a track is alternate or supplementary. Another <role> in the same audio set, can identify main or translation.
Comment 5 Bob Lund 2011-08-29 16:08:45 UTC
An FCC report and order regarding video descriptions has been issued  (http://transition.fcc.gov/Daily_Releases/Daily_Business/2011/db0825/FCC-11-126A1.pdf). This cites the requirement to support pass-through (paragraph 20) and makes reference to the use of the secondary audio program for carrying video descriptions (paragraph 28). These are requirements to support descriptions that are embedded in the program's main dialogue (and received as a single composite track by the receiver). I think this is strong motivation to add a new category to AudioTrack.kind to identify audio programs with embedded descriptions as proposed in this bug.

(In reply to comment #0)
> Problem statement:
> 
> Audio description tracks can come in two forms. In one form, the audio
> description is a separate track from the video main dialogue. In another form,
> used by satellite and cable television, audio descriptions are premixed with
> the main dialogue track. U.S. Cable (and Canada) follow [1] where section 6.4
> Visually Impaired (VI) specifies The VI associated service is a complete
> program mix containing music, effects, dialogue, and additionally a narrative
> description of the picture content. The current HTML5 specification Editors
> Draft defines one kind of audio description track, description, which
> identifies an audio track of the first form, i.e. a separate audio description
> track. Content authored for satellite and cable television (audio descriptions
> pre-mixed with the main dialogue) that is delivered to an HTML5 browser also
> needs to be identified by AudioTrack.kind categories. 
> 
> Specification sections affected:
> 
> HTML5 A vocabulary and associated APIs for HTML and XHTML section
> AudioTrackList and VideoTrackList objects.
> 
> One suggested solution:
> 
> Define two new AudioTrack.kind categories in section AudioTrackList and
> VideoTrackList objects: main+description is the main dialogue track with
> embedded audio descriptions, translation+description is a translated version
> of the main track with embedded audio descriptions.
> 
> [1]"ATSC A/53: Part-5 standard for audio coding & delivery"
> http://www.atsc.org/cms/standards/a53/a_53-Part-5-2010.pdf
Comment 6 Silvia Pfeiffer 2011-08-30 00:41:27 UTC
Bob, you are referring to the "pass-through" rule and references to the use of secondary audio programming in the FCC ruling. What this refers to is the use of multichannel video programming distributors (MVPDs), which provide a second and separate audio channel to the main one in which additional audio content can be distributed. Most of the concerns raised in the FCC document are about the current use of the secondary channel for a second language audio track and that a requirement to provide audio descriptions would require stopping that service which is not acceptable for a different part of the MVPDs audience. All of the discussion in the FCC document are about MVPDs, i.e. about distributing audio descriptions in a secondary audio track. Nowhere do I see a discussion about a mixed audio tracks. The only mixing that happens is in the receiving device for display purposes.

So, I don't really follow your argument that this is a reason for introducing a new @kind value.


In contrast, the ATSC Digital Television specification indeed allows an associated visually impaired service to the main service to be either a single channel (which is what is typically implemented) or a complete mix of all program elements (i.e. main and audio description). See

http://www.atsc.org/cms/standards/a_54a_with_corr_1.pdf (6.6.2.3, 6.6.4.3)
and
http://www.atsc.org/cms/index.php/standards/document-download/doc_download/13-a52b-digital-audio-compression-standard-ac-3-e-ac-3-revision-b (p. 117, full_svc flag)
However, it seems that most implementations only do a single channel. Do you know if there are actual implementations out there that do such a mixed delivery? And do they actually mix the channels or do they dedicate, for example, the center channel to audio descriptions, thus retaining the separation?
Comment 7 Bob Lund 2011-08-30 01:34:54 UTC
(In reply to comment #6)
> Bob, you are referring to the "pass-through" rule and references to the use of
> secondary audio programming in the FCC ruling. What this refers to is the use
> of multichannel video programming distributors (MVPDs), which provide a second
> and separate audio channel to the main one in which additional audio content
> can be distributed. Most of the concerns raised in the FCC document are about
> the current use of the secondary channel for a second language audio track and
> that a requirement to provide audio descriptions would require stopping that
> service which is not acceptable for a different part of the MVPDs audience. All
> of the discussion in the FCC document are about MVPDs, i.e. about distributing
> audio descriptions in a secondary audio track. Nowhere do I see a discussion
> about a mixed audio tracks. The only mixing that happens is in the receiving
> device for display purposes.

No, this is not the case. There are only two audio channels - main and secondary. Audio descriptions are mixed with the main dialogue by the programmer and carried in the secondary audio channel. This channel is a mix of dialogue and description. Receivers only tune one of the audio channels. There is no mixing in the receiver.

> 
> So, I don't really follow your argument that this is a reason for introducing a
> new @kind value.

The current @kind does not support identification of an audio channel with dialogue and descriptions.

Bob
> 
> 
> In contrast, the ATSC Digital Television specification indeed allows an
> associated visually impaired service to the main service to be either a single
> channel (which is what is typically implemented) or a complete mix of all
> program elements (i.e. main and audio description). See
> 
> http://www.atsc.org/cms/standards/a_54a_with_corr_1.pdf (6.6.2.3, 6.6.4.3)
> and
> http://www.atsc.org/cms/index.php/standards/document-download/doc_download/13-a52b-digital-audio-compression-standard-ac-3-e-ac-3-revision-b
> (p. 117, full_svc flag)
> However, it seems that most implementations only do a single channel. Do you
> know if there are actual implementations out there that do such a mixed
> delivery? And do they actually mix the channels or do they dedicate, for
> example, the center channel to audio descriptions, thus retaining the
> separation?
Comment 8 Mark Vickers 2011-08-30 23:24:26 UTC
I checked with our experts on audio description tracks and got the following answer:

"From an International (European) perspective it is correct that Video Descriptions may be delivered separately for downstream mixing, or Descriptions may be delivered pre-mixed.  However in the US and Canada the only technique in use or contemplated at this time is pre-mixed and delivered as an alternate "complete main" audio channel.  I think the W3C needs to account for the differences between how this is done indifferent parts of the world"

I believe this is because US and Canadian client devices (TVs & set-top boxes) currently only switch between the main audio track and the secondary audio programming (SAP) track. In the US, this often switches between English and Spanish dialog. The client devices never mix the two audio tracks. Therefore, new audio description tracks will be pre-mixed so the client devices can continue to switch between the tracks with no audio mixing at the client.
Comment 9 Clarke Stevens 2011-11-09 22:39:22 UTC
At the F2F meetings in Santa Clara, the HTML WG appeared to support this bug. The resolution suggested by the Media Pipeline task force (MPTF) is the same one suggested in the original bug report. Here is a link to the MPTF requirement:

http://www.w3.org/2011/webtv/wiki/MPTF/MPTF_Requirements#R1._Combined_Main_.2B_Description_Audio_Track

The MPTF is open to other solutions as well.
Comment 10 Mark Vickers 2011-11-09 23:06:03 UTC
(In reply to comment #9)
> The resolution suggested by the Media Pipeline task force (MPTF) is the same
> one suggested in the original bug report.

To be clear, the referenced MPTF document actually describes two alternate solutions, both of which were discussed at the F2F. Option A is the same as in the original bug report. Option B is a more general alternative solution.

Option A: Define two new Category values:
   "main+description" - pre-mixed main audio track and audio descriptions
   "translation+description" - pre-mixed translated audio track and audio descriptions

Option B: Make Category a list, allowing other combinations (e.g. video with main and sign)

Coincidentally, the W3C just posted a promotional video that includes a version with descriptive audio in the currently unsupported pre-mixed format, illustrating the need for this bug to be fixed:
   http://www.w3.org/2011/11/w3c_video_described.html
Comment 11 Silvia Pfeiffer 2012-02-09 05:23:37 UTC
(In reply to comment #10)
> Option B: Make Category a list, allowing other combinations (e.g. video with
> main and sign)

I don't mind Option B though I hope it occurs infrequently.


> Coincidentally, the W3C just posted a promotional video that includes a version
> with descriptive audio in the currently unsupported pre-mixed format,
> illustrating the need for this bug to be fixed:
>    http://www.w3.org/2011/11/w3c_video_described.html


Note that that video is an alternative to the original video because it is re-mixed and longer than the original. If you're using it as an example where this video's audio track would be published separately from the original video as a AudioTrack through the multitrack API, then that's unrealistic and would not happen, because they would not synchronize.
Comment 12 Ian 'Hixie' Hickson 2012-02-10 00:39:28 UTC
My intent here is to not add anything until there is at least one format that supports this, since there is no point the HTML spec saying to do something that can never happen.
Comment 13 Bob Lund 2012-02-13 21:55:26 UTC
(In reply to comment #12)
> My intent here is to not add anything until there is at least one format that
> supports this, since there is no point the HTML spec saying to do something
> that can never happen.

Premixed audio descriptions for visually impaired + main dialogue, as required by the recent FCC ruling
(http://transition.fcc.gov/Daily_Releases/Daily_Business/2011/db0825/FCC-11-126A1.pdf), is specified in AC-3 audio used in MPEG-2 TS. This spec
(http://www.atsc.org/cms/standards/a_52-2010.pdf) defines how the presence of this pre-mixed audio track is signaled: in the  AC-3 bit stream information syntax (section 5.3.2)  field bsmod = 2 signals the visually impaired audio stream (see Table 5.7 in section 5.4.2.2) and in the AC-3 descriptor field full_svc = 1 signals a full audio service (section A4.3), meaning the track includes the main dialogue audio.

This standard is adhered to in North American broadcast channels. These AC-3 signals could also exist when these channels are redistributed over IP.
Comment 14 contributor 2012-07-18 04:33:40 UTC
This bug was cloned to create bug 17797 as part of operation convergence.
Comment 15 Silvia Pfeiffer 2012-10-06 10:31:00 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If
you are satisfied with this response, please change the state of
this bug to CLOSED. If you have additional information and would
like the Editor to reconsider, please reopen this bug. If you would
like to escalate the issue to the full HTML Working Group, please
add the TrackerRequest keyword to this bug, and suggest title and
text for the Tracker Issue; or you may create a Tracker Issue
yourself, if you are able to do so. For more details, see this
document:   http://dev.w3.org/html5/decision-policy/decision-policy-v2.html

Status: Accepted
Change Description:
https://github.com/w3c/html/commit/7edfb30d9b448d80c66c0a5d95a98445d635a2db
Rationale: accepted WHATWG change