Accessibility of Media Elements in HTML 5 Gathering

01 Nov 2009

See also: IRC log


Judy_Brewer, Dick_Bulterman, Sally_Cain, Eric_Carlson, James_Craig, Michael_Cooper, Marisa_DeMeglio, John_Foliot, Geoff_Freed, Ken_Harrenstien, Philippe_le_Hégaret, Ian_Hickson, Chris_Lilley, Charles_McCathieNevile, Matt_May, Frank_Olivier, Silvia_Pfeiffer, Janina_Sajka, Felix_Sasaki, David_Singer, Joakim_Söderberg, Hironobu_Takagi, Victor_Tsaran, Doug_Schepers
Dave and John
Chaals, Silvia, fsasaki, Chris




<Hixie> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2009-July/021125.html

<chaals> ScribeNick: chaals

<scribe> Scribe: Chaals

<dsinger> wondering where a few people are: silvia, judy, and a few others


JF: I am John Foliot. I want to show the Captioning service from Stanford at some point.
... work on accessibiltiy services at Stanford

JS: Janina Sajka, chair WAI Protocols and Formats. Want to present some thughts about need for controls API

SC: Sally Cain, RNIB. Member of W3C PF group. Echoing what Janina said, want to talk about audio description

AT: Hiro from IBM lab Tokyo. We have been working on this stuff for a decade. Am a general chair for W4A. Want to indtroduce our audio description stuff.

MM: Matt May, Adobe. Want to talk about what we learned from accessibility in Flash, and the authoring tools context

MC: Michael Cooper, work as WAI staff. We're interested in W3C technology having accessibility baked in. I am intersted in using existing stuff rather than new things where effective.

Frank: Microsoft, here to listen

KH: Ken Herrington, google. Work on captioning systems - would like to show a bit of what we have done, and have tried to avoid standards work for ages (but failed now ;) )

Joakim: researched at Ericsson, everyday job is on indexing (photo tagging, media services, ...)
... co-chair of media annotation group at W3C
... want to talk about what we have done at Ericsson.

FS: Felix Sasaki, from University of Applied Sciences in Potsdam. Teaching metadata and in media annotation workshop. Not presenting in particular

Marisa: DAISY Consortium developer (we make Digital Talking Book standards)
... here to learn
... can present a bit about DAISY and possibilities with HTML5

DB: Dick Bulterman, co-chair of SYMM (group at W3C doing SMIL)
... I have been working on this for 14 decades :( Researcher at CWI in Amsterdam, interested in authoring systems for multimedia presentations.
... would like to talk about what SMIL has done, maybe demo a captioning system we have for YouTube and some seperate caption streams allowing 3rd-party personalisation

IH: Ian Hickson, Google, HTML5 editor.

CL: Chris Lilley, W3C Hypertext CG cochair, CSS and SVG group, etc. Want to make sure that whatever we can do will be usable in SVG as well as HTML. Interested in i18n question - make sure you can have different languages.

PLH: Philippe le Hegaret, W3C. Responsible for HTML, and video, within W3C. Late-arriving participant in timed text working group. Hoping to get that work finished, have a demo of timed text with HTML5.
... Didn't hear anyone wanting to present the current state in HTML5

JF: Also representing Pierre-Antoince Champain from Liris (France) who ar working on this.

EC: Eric Carlson, Apple. Mostly responsible for engineering HTML5 media elements in Webki

CMN: Chaals, Opera. In charge of standards, hope not to present anything but interested in i18n and use of accessibility methods across different technologies withut reinventing wheels

DS: Dave Singer, Apple, head of multimedia standards. Interested in building up a framework that gets better accessibility over time.

SP: Sylvia Pfeiffer, work part time for Mozilla, trying to figure out how to get accessibility into multimedia (including looking at karaoke and various other time-based systems). Have some demo stuff to show, but want to review the requirements we have...

GF: Geoff Freed, NCAM, want to talk about captions and audio description in HTML5

JB: Judy Brewer, head of WAI at W3C. Interested in how the options for accessible media affect the user, especially when there are multiple options.

DS: Please do not speak over each other, speak clearly and slowly so interpreters and scribe can follow.

agenda bashing

DB: SMIL current experience

[we get: DAISY, Geoff, Google, Timed Text, Stanford Captioning, Silvia, Matt]

DS: go like this:

1. Geoff - about multimedia

2. John, Standford

3. Ken, Google stuff

4. Marisa, Daisy

James Craig rolls in from Apple

5. Silvia, stuff she has done

6. Matt (Flash)

7. Dick - SMIL-based captioning

8. Philippe, Timed Text

2.5. Pierre's video

<dsinger> geoff: how do we do your presentation?

Geoff Freed

GF: Want to talk a little about my concerns.
... we also did some Javascript real-time captioning
... We have been playing around at NCAM with some captioning stuff that might work with video as is in HTML5, using real-time captioning.
... we have been testing by stripping captions from a broadcast stream and embedding them in a page using javascript. We have a way to change channels - e.g. to have live captioning for an event like this
... or it could be used as a way to stream caption data over different channels.
... not the ideal way, it involves some outside work
... would like to see something like a caption element rather than having to inject stuff with JS because that is probably not the most efficient way.
... We have a demo, I will send a screenshot

John Foliot and Stanford workflow for Captioning

JF: My role is assisting content providers to get accessible content online. Video has been a problem for some time.
... Feedback from people with Video was that they found it expensive and difficult to actually make it happen when they were just staff doing something else, not video experts.
... We made a workflow that allows staff to get captioned video online more or less by magic.
... We set up a user account, and they can upload a media file.


scribe: We have contracted with professional transcription companies (auto-transcription was not accurate enough for us) to do transcripts.
... $5/minute to get 24-hour turn-around, $1.50 / minute for 5-day turnaround - $90 per hour of video.
... System allows us to use multiple transcription services. We have created some custom dictionaries (e.g. people's name, technical terms we use, and so on)
... Content owners can also add special terms if they use something odd, to improve accuracy.
... If you already have a transcript you can upload that instead. (We then do the remaining work for free)

JF: Upload file, and we generate multiple formats - FLV, MP4, MP3 (which is sent to transcript company).
... email is sent to transcription company when we have the file, and one to content producer so they can start putting their content online even if they haven't yet received the captions.
... When transcription is done it is returned by the transcription company into the web interface.
... Then we do some automatic timestamp generation to turn transcript into various formats.
... User gets an email saying they have their captions, and we give them some code to copy that incorporates the captions into the web.
... This is not quite shrink-wrapped and painless, but it is pretty close. You still have to shift some files yourself from server to server, but we are working on automating these steps too.
... We rolled out the system at the end of the summer, have some users now and are talking to campus content creators to roll it out widely.
... Scalability of production is important. Stanford makes maybe 200 hours of video / week, and a bunch of that has archival value. So we need to be able to run it simply, and scalably.

SP: You mentioned a company that does timesamping

JF: Docsoft http://docsoft.com

SP: Open source?

JF: Nope. Shrinkwrap product based on Dragon. Speech reco not close enough for datamining, but good enough to automate timestamping completely.
... We focus on the datamining aspect of this as much as the accessibility

JC: Have you experimented with the UI in the video player for this?

JF: Not at this point. Also talking to reelsurfer about this. We are looking at it.

SP: Linking directly - should mention that there is a W3C Media Fragments group, looking at standards to directly address into a time with a URI

JF: We see people putting stuff onto YouTube - and that will use our caption materials if they are there. We are giving people the ability to do this stuff...

<scribe> ACTION: John to provide a link that shows us some of how the stanford system works and looks for users [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action01]

<ChrisL2> Please let us have your sample code for embedding, so we can see what people are using today

Pierre Antoine

Video is presented (slideset and commentary)

<ChrisL2> Wonder what "speaker diarization" is ?

SP: Tracking who is speaking when

Ken Harrenstien, Google


(Note that the video *relies* on the captions)

KH: Video is spoken and signed in different langauges. Trying t oclarify it isn't for a handful of poor deaf people, but for everyone.

(Luckily everyone here gets that)

PLH: Are the captions burned into the file or seperate

KH: Seperate. I will explain that...
... A bit about numbers. Google has a very large number of videos, and everything has to work at scale.
... Numbers are importa, going back to what John was saying, we show there is value for a lot of poeple. Now we have numbers like how many people are using captions on You Tube.
... those numbers can drive the adoption of captioning.
... we offer video searching of caption data.

(demo of actually searching, digging into a particular term's occurence in a video)

PLH: Is there anything in the usage numbers about how much people use video and if captioning has an influence on it.

KH: Don't have numbers to give, but they are good enough that I get more support in Google than I used to :)
... want to let people see the numbers for their own video.

[Shows translation of captioning]

PLH: Speech to text, or caption as a source?

KH: Taking the captions as a source.

SP: Has Google included the offset search for YouTube into the Google engine.

KH: Yep. That was what I just showed.

SP: Captions can be used to improve search results. Is Google using them for indexing videos?

KH: Yes

DB: If I post a video, and someone else wants to caption later, how does that work? Can anybody caption anybody else's video?
... can I restrict who uses the content?

KH: There are several websites that allow this. You can do it with or without permission.

DB: So I effectively make a new copy of the base video?

KH: Sort of depends on how you do it. For YouTube you ned to be the owner - or contact the owner.

JF: There are other 3rd prty tools that pull tools and transcript from seperate places.
... The value of captions gets attention when people see the translation of captions

KH: Behind the scenes...


SP: You use a new XML format instead of SRT - for some particular reason?

KH: We control it. SRT isn't always so sweet....
... we actually produce various formats (incl. srt among others). It's easy to add new ones, so we don't care which format people have.

http://video.google.com/timedtext?v=aC_7NzXAJNI&lang=en&tlang=ru is a live-generated translation

Multi-channel caption streamer.


Caption text is stripped from the broadcast and injected in real time using Javascript.

various UI controls on top.

Transcript button can also give you a full log.

Break - 17 minutes.

<ChrisL2> ScribeNick: ChrisL2

<chaals> [Please do not use camera flash]

Marisa gives a presentation

John introduces a guest account for the Stanford facility to experiment with the transcription service

Marisa DeMeglio, DAISY

[Marisi introduces DAISY with a demo]

<jcraig> http://captiontool.stanford.edu/ user: W3C_HTML5 pass: Accessible!

DAISY uses SMIL to synchronise the audio with the html text

MD: DAISY 4 being defined adds forms and video. Has an authoring and a distribution side. Can HTML5 be used for distribution
... Native browser support for audio and video is good, but we se barriers in html5, for example the extensibility. How to de add sidebars for example. Needs more roles than are built in.
... Want to avoid downgrading the content to represent it

JS: DAISY did a demo with two filmings of Henry V, scene by scene sync for comparative cinematography plys avatar for lip readers. Still interest in multiple synced media levels?

MD: Will be able to answer that next week

JS: Not clear if it fits in HTML 5 or 5.1 or 6

MD: Interested to use SMIL and HTML5 together to get that synchronisation
... also forms and annotations
... constrained by implementations

Silvia_Pfeiffer, Demo and Proposals.... and Everything

SP: Demos with three Firefox builds - standard, with screenreader, and custom patched to add native accessibility support
... Need for signing captions for hard of hearing and deaf
... audio descriptions
... textual transcripts for screen reader or braille
... All of these create multiple content tracks. Multiple audio, text, and video tracks for a composite media eleent
... need a standard javaScript interface to control this
... and coneg reduces download to just the items of interest
... Text tracks are special and much more valuable outside the multimedia container. Easier to search index etc more easily than burried in media
... aids editability, crowdsourching, CMS integration
... sharable

Proposes an itext element with @lang, @type, @charset @src @category and @display

<silvia> https://wiki.mozilla.org/Accessibility/HTML5_captions

sent to whatwg and html5 list, not much discussion there, but feedback directly on the wiki

[demo with video and selectable captions, subtitles]

SP: Issue is that all timed text tracks are treated identically. So next proposal identifies some categories, not all supported in the implementation
... but these are the categories in use based on 10 years of collecting them

[demo with compiled firefox, rather than using scripting]

(prettier interface compared to the scripted one, same functionality, captions are transparent background now like proper subtitles

CL: Question on encodings vs display

PLH: Why is this an issue, Mozilla does this

SP: Uses HTML existing code for layout

PLH: default captioning language

SP: display="auto" selects an auto language negortiation. other options are none and force
... user can override author choice, but author shoule be abel to express their design too
... second proposal has a grouping element itextlist to express common categories etc rather than repeating them


dom interface allos to see if a given itext element is active

PLH: Are you generating events?

SP: Yes, onenter and onleave for new caption blocks or segments, so can listen for that

PLH: charset because srt ddoes not indicate the charset?

SP: yes. Some formats dont provide this
... charset is optional, some formats self describing and don't need it
... no registered mime type for srt

[discussion on where SRT is defined]


SP: currentText api shows the currently displayed text, so a script can manipulate the text or display it
... also a currenttime interface
... works the same for external text tracks of for ones in the media container

JC: So can do searching or access text that will be displayed later

[demo of v2 using scripting implementation, changing language on the fly]

[demo of firefox with a screen reader, firevox]

[demo hgad audio discriptions for vision impaired, and text to speech audio description. uses aria.]

aria active region

SC: Will that use the defualt screenreader, or ontly the one in the browser

SP: the default

CN: How do you add audio ?

SP: needs native support in the browser with a n interface so its the same for internal and external sources ... audo and video needs to be synchronised
... dynamic composition on server is recommended way to do that

CMN: Issue with therird party signing track, third party has no access to server where the video lives

SP: text is special, but audo could be treated like that too. To be discussed
... needs a spec

CMN: Signed video is similarly special to subtitle text

JC: Can the audio pause the video for long descriptive text (for time to read it or have it spoken)

FO: If the text is inside the container and there is external text too what happens?

(we agree that the proiority needs to be defined)

FO: User needs to decide, no one size fits all solution

SP: yes we need the flexibility there and have the api to make it workable

JS: Resource discovery

CMN: So need to think about an API that finds internal as wel las external tracksa dn treat them uniformly

HT: Can itext eleents be added or changed dynamically?

SP: yes

S\[SP demonstrates a video hosting site with ogg audio, video, and subtitle./captioning/transcript support]

(all done in HTML5 and script)

http://oggify.com/ under development

SP: Like youtube except with open source and open formats and script plus HTML5

FS: Where does the timing info come from?

SP: From the subtitle file, they all have start and end times

JS: Nested structural navigation is important. chapters, sections etc
... access to next scrne next act would be good

SP: Titled text tracks have DVD-style chapter markers
... linear though, not hierarchical due to limitations of flat file

MD: Yes DAISY infers structure from heading levels

SP: Complext to bring in generic HTML and then display it o anothert HTL stream .. security issues
... media fragments wg is specifying how to jump to named offsets as well as time offsets
... not finished yet

JS; Direct access also for bookmarking as well as flipping through

SP: Chapter markers and structure exposes the structural content of the video, for speed reading among others. can do it with URIs so bookmarkable and can be in the history

Matt_May, Adobe, Flash experience

[Matt talks about history of accessible captioning in Flash]

MM: Two minimal requirements, flash support in video since Flash 6 and later the ability to insert cue points to associate captions with the video, in Flash 8
... several attempts to crreate captioning, but they were unsynced so unsuccessful. Result was lack of adoption and a thousand hacks to try and do it
... cue points got us closer, reliable feature
... starting with flash 8, reliable caption sync but no standard way to do it. usually embedded int ehc ontsainer so hard coded fonts, done in actionscript. Content buried inside script
... Came to realisation that inside-only was a naive approach, looked for alternatives. In flash 9 we supported timed text dfxp
... can assocate flv_playback_caption component, takes an external timed text file for captions
... used an existing standard, tt was there
... not re-re-re-re inventing the whee;
... third parties can build captions and authoring software
... hopeful that other formats adopt dfxp as well
... Breaking out the captions, as Sylvia and Ken mentioned, is important for any standard. Cam embed but thats just the first step. Only download the required captions. Allosw third parties to add their own content. Crowdsourcing captions
... dealing with third parties adding captions later
... For html5, also important to have captions *inline* in the html document itself. Not complex to add

CL: SMIL also found a need to have text inline as an option

DB: One option

SP: most of my demos are at http://www.annodex.net/~silvia/itext/

MM: Everyone is familiar with magpie

[we aren't]

MM: Shows this is well tried territory, so HTML5 should be able t use existing authoring tools and workflows to do this

<jcraig> MAGpie is a captioning tool from NCAM

MM: So please consider existing solutions

<jcraig> http://ncam.wgbh.org/webaccess/magpie/

JC: Many content providers of video have no idea how captioning works, as other people do it

M: Using captions in AJAX, inserting vie athe DOM in real time. Issues of scrolling.

[lunch break]

<fo> t

<silvia> Dick Bulterman (sp?) will talk about SMIL

<scribe> ScribeNick: sylvia

Dick_Bulterman, CWI, SMIL Text

<dsinger> ScribeNick: silvia

chaals: I will scribe *for a bit* :-)

<chaals> Scribe: Silvia

"Supporting Accessible Content" - lessons from SMIL 1/2/3/

co-chair of SMIL working group

head of distributed & interactive systems group at CWI

interest for long time in working with multimedia

temporal & spatial synchronisation

take-home message: a11y isn't about a particular media format

it is about supporting selectivity among peer encodings

e.g. different encoding of same content for different situations, e.g. when driving/reading/conference

it is about a coordination mechanism to manage selection of media streams

difficulty is that they change over time

it is about providing 'situational accessibility' support

nobody wants special-purpose tools

<plh> --> http://www.w3.org/2009/Talks/1031-html5-video-plh/Overview.xhtml plh's slides

we should solve the problem for everybody

we need to make the temporal and spatial synchronisation explicit to be able to do the complex things

what is accessible content?

what kind of things would need to be done with a video object

it could be a svg object or another object, too

- want to add subtitles

- want to add captions and labels

labels are a cheap and simple way to communicate what is being visible in the video

"you are about to see my son play an instrument"

- line art & graphics

it would be nice to have a uniform model for all types of objects

- audio descriptions

- semantic discriminators

people want things at different levels of detail at different times

Some experiences from our SMIL work:

- not all encodings will be produce by the same party

e.g. even while Disney owns the video, they may not be the ones to create the captions

- not all content will be co-located

e.g. a video may be on one server, but content enrichments will be on many different servers

if you want highly synchronised audio and video, they practically have to be in the same file

the network delays can easily add up to make it impossible to synchronise them

but you can create some things on different servers

- there may be complex content dependencies

- each piece of content may not be aware of the complete presentation

*SMIL Support for Alternative Content*

SMIL 1.0 in 1996

<switch> : selection based on system test attributes

(language, bitrate, captions)

support alternative selection of parallel tracks

demo of MIT Prof. Lewen (sp?)

in SMIL:

<video src="MITguy" … />


<text src="A" systemLanguage="nl" systemCaptions="on"/>

<plh> s/ Lewen (sp\?)/Walter Lewin/

<text src="B" systemCaptions="on"/>


<ChrisL2> ther eis an implicit PAR around that example so the video and the switch play in parallel

document order is only fallback - user preference dominates it

*SMIL 2.0 (2001)*

- custom test attributes

- added <excl> tag for pre-emptive inclusion of content

<excl> provides support for audio descriptions

event-based activation

demo of a video, which pauses on a schedules pre-empt to wait for an audio description to be displayed

when that audio description is done, the video continues

*SMIL 3.0 (2008)*

- number of different extensions

- <smilText>: another timed text format - allows embedded hyperlinks, allows style sheets, allows motion, allows fine-grained text

streamable labels, captions, mW events

- smilState: allows coordination via data model

- timed, decentralized metadata

- media pan & zoom (temporal focus, e.g. Ken Burns effect, coupled with audio description)

demo of a web page with three buttons used to influence the presentation

* SmilText *

Why not simply reuse DFXP?

- it was not intended to be embedded in SMIL

- it isn't a streaming format

- it doesn't allow mix of absolute/relative/event timing

- it doesn't handle motion text

- layout + style processing are idiosyncratic

SMIL needed to support live TV streaming and supporting live captioning wasn't possible with DFXP

smilText was explicitly designed to map well with DFXP

there is still an easy mapping

- smilText is functionally compatible, with a direct mapping to DFXP

- smilText is also a direct replacement for RealText

*What will HTML5 need?*

* video object can't always determine timeline

- need external-to-video notion of temporal/spatial coordination

* Simple media control (start/stop/pause) are not rich enough

- impossible to enumerate all of the things that you may want to start/stop/pause in parallel

it might be a good idea to create a middleware that handles e.g. pausing across all involved elements

* need to support embedded and external companion content

*T/S coordination info: where?*

temporal/spatial coordination should go into a script? a web page header? a flash object? into SMIL?

a companion media object?

- very laborious for fine-grained timing

in script controlling directive activation?

- probably not, cause it doesn't scale

In companion synchronization specification

- good for extensibility

- options:

— fully external

— timesheet

— internal

timesheets are like style sheets for synchronising timing

html+time is an example

there is no right answer - but it is important to have all the flexibility

*Editing Complex Presentations*

Demo: GRiNS (1996-2004)

authoring sw for smil presentations

- interactive navigation

-scalable presentations

demo: BBC did a 40min newscast with a structured view and direct access using SMIL

a structured view gives you all the possibilities of gaining the presentation in the way that you want it

*Adding Captions to 3rd Party Videos*

An Aside:

- helping the community to share comments on videos that other people own

demo: Ambulant Captioner

*Adding Captions & Labels (and Context)*

After selection, add comments

helps caption authoring

does predictive timing on your captions

it helps people produce timing more easily

*Putting Navigation into Captions*

captions are being used to provide navigation

temporal hyperlinking

intra-clip navigation

inter-clip navigation

*More about SMIL & Accessibility*

SMIL 3.0 Book:

find it on xmediaSmil.net

captioning tool: www.ambulantPlayer.org/smilTextWebApp/

together anywhere anytime project: www.ta2-project.eu

smilText: code.google.com/smiltext-javascript/

ugliest demo section :-)

JS SMIL Player: ambulantPlayer.org

timesheets: w3.org/TR/Timesheets

re-usable timing

that ends the presentation of DB on SMIL


joakim: is SMIL used?

MMS is based on SMIL

Quicktime media player

<ChrisL2> s/www./http:\/\/

digital signage

quicktime on the desktop had a smil implementation

windows media player

windows media player uses it to pre-empt national security events


joakim: playlist of product presentations?

in supermarkets France or Finnland uses SMIL for kiosks

interactive selectivity is one of the big things there

having the target of a hyperlink change over time

we see a lot of ways of deployment of SMIL, but it's been frustrating that it's not been used more

things move incredibly slowly

even with dozens of years of experience with interactive multimedia, we still only have a <video> tag in html

SMIL has the power that easy things can be done easily, but also difficult things in a simple way

finishes Dick's presentation

<plh> --> http://www.w3.org/2009/Talks/1031-html5-video-plh/Overview.xhtml plh's slides

next: Philippe on the state of timed text (DFXP)

shows a html5 video demo

demo: www.w3.org/2009/Talks/1031-html5-video-plh/Overview.xhtml#(2)

<ChrisL2> http://www.w3.org/2009/Talks/1031-html5-video-plh/

Timed Text

came out of the TV world

started in 2003

original idea was to have an authoring format for the web

as Adobe used it in Flash, it became increasingly a delivery format

demo: NCAM flash player with DFXP



xmlns: tts="http://www.w3.org/ns/ttml#styling">



<p begin="0s" end="10s">

This word must be

<span tts:color='red'>red</span>

<br />and this one

<span tts:color='green'>green</span>.





highly controversial use of name spaces

*online captioning*

# Parameters: frame rate, ... (SMPTE, SMIL)

# Styling: XSL FO 1.0, CSS 2

# Layout and region

# Timing model: SMIL

# Basic Animation: SMIL, SVG

# Metadata

example timed text document

it's possible to use it in a streaming context, but you have to be careful what additions you make to the file

test suites


one demo is with HTML5 using javascript to synchronise with the <video> element

the list of which features are supported in which player is given at http://www.w3.org/2009/05/dfxp-results.html

Adobe and MS players are still prototypes

JW player's support is disastrous

WBGH support is also not that great

I don't know what the plan is to update those implementations

plh shows different web browsers supporting html5 and MS/Flash implementation in test interface

*Recent Progress*

finishing on the testing

published last call

dynamic flow still needs testing

we're waiting on the implementation from Samsung

trying to become W3C recommendation by Dec 2009

but we need the dynamic flow implementations first

SP: when did you last update the HTML5 DFXP implementation

plh: I need to update it a bit, but there is cool stuff that can be done with DFXP

finishes plh demo

next up joakim

"Media Annotations Working Group - overview"

we started a year ago

I'm co-chair, Felix used to be staff contact


- facilitate metadata integration for media objects in the Web, such as video, audio and images

- means is to define an Ontology and API for metadata

we're part of the Web Video work in W3C

we're trying to re-use what exists

vision is to make it easy to use

the ontology relates the different metadata formats

the mapping between different formats is provided by the working group

some formats in sight:

XMP, DublinCore, ID3 etc

*Definition of metadata properties for multimedia objects*

example properties

* ma:contributor

* ma:language

* ma:compression


*Relating properties to existing formats*

gives an example where ma:contributor maps to media:credit@role in YouTube Data API protocol

ma: contributor maps to dc:creator in XMP

semantic mappings: exact, related to, more specific/more general

syntactic mappings e.g.

unicode string, also given

*API for MEdia Resources 1.0*

example for ma:contributor property

consists of id and role

API was published 2 weeks ago and is in first public working draft


- General

— reading is easy part, but how to (or if to) write "ma:" properties into media files

— getting verification of mappings (needs to be based on actual usage, not on the specifications)

- specific to accessibility

— how does the media nnotations approach fit to a11y needs?

— are there a11y related attributes that are missing?


provides links to home page, requirements,, ontoloy, and API spec

Marisa: how would you use this?

DAISY is using the book ontology

we're interested in this ontology

Felix: it's easy to create the mapping table, but to get the feedback on mapping ontologies to each other is difficult

also, it's different what is being written into the spec and what is used in the wild

Doug: are they mis-using some of the bits in the ontology for other reasons?

Felix: no, just using subsets

ends joakim's presentation

Ian's presentation next

"Where are we in HTML5 with <video>?"

-html5 has an audio and a video element

- basically the same element

- defined as a single abstract concept of a media element

- the ui is basically up to the browser

- common codec is a challenge


what we're seeing there is: the part above the controls is the video element

the part below the video is SVG - it could use HTML div/buttons etc

controls in this example are all scripted

when you hit the play element, it sends video.play() and then the video starts playing

there is scripted access to the loudness control

if you enable the video UA controls, they are also shown

browser UI and scripted UI are in sync, so if you silence the video through either, the other reacts

the API gives you further information

e.g. state of network

playback/buffering state

you can seek

it basically supports streaming content

browser is exposing a slowly moving window into the video

playback rate can be changed

you can make it loop


the goal was not to do SMIL

if you needed the whole support of SMIL. you'd use SMIL

DB: we need to be careful to go down the road of this
... similar things need to be called the same only if they are completely the same


2-3 ways that a11y is built into the API

* tracks that are built into the media resource

not currently exposed in the API

the browsers are expected to expose that if it's available

* javascript overlays

missing the cue ranges API now

* <source> element is not just used for different codecs, but also for different bitrate/quality videos

Next version is expected to have something similar to what silvia suggested <itext>

the main thing blocking this now is that the things already in the spec aren't even implemented

what's already in the spec needs to be implemented solidly before more spec is added

a test suite for this is still missing

* extensibility

HTML5 has successful extensibility mechanisms


CL: <video> implementation being incomplete may also be because the spec is incomplete, so extending the spec would be better than waiting for full implementation

DB: you need to bring people in and you need a level of functionality that is more attractive than just video playback
... you need to have a roadmap

Ian: what Silvia is proposing is pretty much what we need

DB: srt timing model is different to SMIL and different from others
... if we come up with multiple timing models, that don't work together - in particular with multitrack audio/video - it might be better to change the timing model
... the impression I have is that we may need to look at <video> more indepth and extend it more, instead of making it too small
... is missing the discussion on the timing model

Ian: my understanding is that it's using the timing model that people are expecting

DB: there are many practical issues for the choice of the timing model - we think <video>'s timing model is too restricted
... we may have a means extend this better

Doug: from what I understand HTML5, no decision has been made on the choice of the timing model and SMIL still fits into it

Ian: the design of the element was based on the idea that any timed actions that need to be done would use SMIL

Doug: the SMIL stuff that is in SVG could be reused in HTML5 - and since browser vendors like to reuse things, it's inclined to reuse that

Michael: no support currently for resource discovery?

Ian: not yet, but it seems silvia has a plan

Marisa: is there any approach to HTML5 where people are creating similar things to a digital talking book with overlays and synchronised SMIL scripts etc?

Ian: if the goal is to specifically synchronise the playback of audio and video, then the approach should be SMIL

joakim: I think media discovery would be a perfect way to use media annotations, e.g. two different versions of a video

Michael: basically the a11y of the video file is left to the video format
... there are characteristics that you need to know about, whether it has captions etc, so given that we have the capacity for scripted controls, then the selector should be in the API

Ian: silvia's itext has that as a proposal

Matt: how do we determine what belongs in the API and what not ?

There's a process for how to extend HTML5

<Hixie> my answer was "that's a judgement call, there's no general rule" :-)

<plh> --> http://www.w3.org/2009/10/W3C-AccessibilityPA.pdf Dick Bulterman slides

<fsasaki> scribe: fsasaki

now judy brewer presentation

judy: interested in quality of acc. support
... current support in HTML5 may be insufficient for captions etc.
... profileration of different caption etc. formats is a problem
... silvia described high-level requirements for captions etc., liked that
... best practices are good, but need an overall approach of sets of requirements
... and description of relations between existing standards / approaches

<plh> --> http://media.w3.org/2009/10/ACAV.ogv ACAV Project video

judy: html5 joint taskforce might be a place to define what requirements that be
... Geoff said that html5 caption solutions are insufficient
... captions should be in html5 like video in html5, that is its own element
... deaf community features are not sufficiently represented
... srt is easy to write with text editor
... but that might mean choose fast gain over long term benefit

silvia: a clarification: itext element tries to be format independent
... srt could be a good baseline format, not XML , but that can be discussed
... allows for linking to any format that your brower supports

judy: Geoff said audio and video should be on the same level, e.g. audiodesc and videodesc element
... time is perfect for html5 to have good solution with caption-specific elements
... externally referencing captions is another need, formulated by Geoff

dick: understand why geoff said that external captions might be good

<ChrisL2> http://www.evertz.com/resources/eia_608_708_cc.pdf

<plh> --> http://www.w3.org/2009/10/MarisaDeMeglio.DAISY.pdf DAISY slides

dick: but would be a shame if we had two mechanisms (two and external) to handle the same thing

<plh> --> http://www.w3.org/2009/10/html5 access mawg-20091101.ppt Joakim's slides

now presentation by dave singer

dave: good acc. needs three things: good specs, uptake by authors and users and user agents
... it is easy on one of the three
... we can do better than TV, not replicating what is there in the non-web world
... timed acc. problems. e.g. captions in audio, video contrast
... some people need high, some people low contrast
... general time management: acc. by flipping information with the media
... rate preferences slower than normal rate sometimes necessary for acc.
... question of untimed acc. : having a transcript
... also untimed: longdesc, fallback not only for non-support, but also for support of e..g video, but not for a specific user
... a question: inside or outside the video container?

David Singer, Apple

dave: inside: media container can have overlay time tracks.
... no mechanism for outside avail. yet, so synchronization with inside is easier to achieve
... meeting users needs: select the resource that the user needs
... choices: by preference, or by action, or both?

hypothesis: out of scope is a user preference repository

<ChrisL2> [Judy explains seizure disorders]

dave descriping possible choice approaches for user needs, see slide 8 of presentation

<ChrisL2> Scribe: Chris

dave: it matters who renders captions engine or somewhere else

silvia: could it all be done by web engine?

ian: depends on media framework used
... we can expose API, but depends on media framework if it is possible or not

dave: scripted accessibility

Janina: you might prefer additional content depending on the specific part of the media

dave: in HTML5, the approach is to link to SMIL for synchronization
... in future we want to have a video in different areas of a page

dick: several ways of doing things
... be careful of controling things: scripted control vs. declarative control
... "scripting" is not always the term you want to use

chaals: worth to see in html5 how to get there

James: flash already supports some of these things

dave: about sign language, there was just one code

silvia, felix: has been solved in latest version of bcp47

dave: summary: what do we need what is not in HTML5, what should be in a best practices document?
... cue ranges are very important
... describing user preferences, probably informative in HTML5 view, since in other specs
... and script access to control features of the media
... and CSS media queries

silvia: about cue ranges
... everybody means something else talking about it
... making the requirements for it clear would help

dave: look not only in captioning, but at the big picture

james presentation

james: interest in universal design for all users
... including technical needs, e.g. related to band width
... looked into content selection without knowing preferences

dave: web developers asked for "how do we find out if a user uses a screen reader"
... currently, SWIF / FLASH are the only means to achieve that
... that has some security implications
... in CSS media query you download everything, different than content selection before download
... potential that user wants to share preferences with certain web servers
... if there are methods in video element like getCurrentCaptions, that would have security implications as well
... certain security restrictions could be more lax for certain users

shepazu: "cross origin resource sharing" is the way we are going to do this
... a question: what do people think about privacy concerns?

james: thought of some preferences, e.g. color-blindness, which you might not want to convey via your browser

dave: very important aspect

matt: heuristics can be used to determine whether a user uses a scree reader

<plh> --> http://www.w3.org/2009/10/dws-access-workshop.ppt Dave' slides

matt: even flash are not guarentees that information is not conveyed

dave: platform integration is a problem, e.g. if the screenreader does not find a play-button
... scripting does not integrate well with platform specific heuristics to find things

janina's presentation

janina: important that API can access control information
... what is default list of controls being disposed
... if it is left to developer to come up with controls, there will be a small set of acc. controls

doug: I am editor of DOM 3 events
... hardware controls are a way of bypassing the problem

james: having video controls standardized methods would help

<silvia> http://www.marcozehe.de/2009/06/11/exposure-of-audio-and-video-elements-to-assistive-technologies/

james: system could have its own key etc. usages

silvia: there is acc. of controls by shortcuts implemented by browser vendors
... important for all users, not only acc.
... through javascript interfaces there is also access to buttons
... that kind of control is on the roadmap for firefox

@@: for webkit as well

janina: have browser folks looked into structural navigation?
... e.g. subscene to subscene, ...
... could be in a 3-hours physics presentation very important

silvia: the @@@@ website works in this, that is navigation markers
... not hierarchical yet, nobody has looked into that yet

janina: DAISY did that to some extend

silvia: if we have time-aligned text tracks, that can go back into video
... both search and a structural overview are important scenarios

dick: producer and consumer aspect are important, latter e.g. user annotations

dave: likely that we will get automated scene detection etc., not manual annotation

janina: there are also use cases for manual creation

john: needs to be a method to do the automatic way too

dick: in SMILtext, people used links in captions

Sally_Cain, RNIB, Audio Description

<chaals> [+1 for IPTV and other TV-based standards being important liaisons to keep in mind]

sally: doing audio description, but also looking into IPTV
... audio description often underrepresented, but is very important
... our broadcasting team says, audio description is mixed in content, or separate
... think that separation is better, or at least awareness that sometimes it is mixed
... using existing guidance from here is good, but how do we integrate work in ISO or ETSI?
... route of deliverance are various, e.g. eTest, eAssignments etc. Need also to be taken into account

Hironobu_Takagi, IBM

takagi-san: work is part of japanese government program
... only 0.5% programs of TV has audio descriptions
... huge expection to provide more audio description also on the web
... cost is most important in audio description
... requires special expertise, human skilled narrator
... text-to-speech has become very good today

<chaals> [impressive demo of emotionally inflected Text to Speech]

takagi-san: project is NICT (Media Accessibilty platform)

Takagi-san shows video describing the benefits of Media Access. platform

dave: all Text-to-speech is done automatically?

Takagi-san: yes, we just have language tagging
... we use also a kind of ruby
... we also plan to use emotional markup
... choices in audio description are: human prerecorded voice, prerecorded tts, server-side tts, client-side tts
... not sure yet what is most appropriate

<ChrisL2> http://www.w3.org/TR/emotionml/ Emotion Markup Language (EmotionML) 1.0

Takagi-san: content protection is another requirement

john: how do you do the synchronization?

Takagi-san: used windows time seeking functions from windows media player

silvia: you have high-quality speech synthesis
... could be a web-service: user sents a text description, turns that into audio, potentially on a different server

dick: source from TTS: where does it come from?

Takagi-san: from a script

dave: somebody typed in a scene description

dick: there are legal issues about changing the content flow e.g. for Disney

Takagi-san: yes, discussing always with content providers like Disney
... legal issue is really important
... that is why we work with Japanese government

janina: this is a problem to be solved on government level, not industry

short break, short discussion after that

wrap up discussion

dave: need a paper or s.t. describing how to get good acc. into html5

michael: made some notes related to that, and about existing solutions
... many people said that there are non-acc. use cases as well

<chaals> Scribe: Chaals

MC: Need to gather use caes and requirements
... gather the non-accessibility use cases
... and the existing solutions.
... There are proposals about technology - which are proposals, which are solutions
... How do we make sure that the technology makes it possible to meet WCAG requirements (not limited to them, but at least getting to that level)

[MC is staff contact for WCAG]

DS: Non-accessibility uses: I am a bit hard of hearing, and there will be stuff I cannot catch. I rewind, mute, and watch the captions. Would like to be able to call the captions for the last section in parallel. Is that a non-accessibility?

JS: Another one is the ability to slow audio as a way to increase comprehensibility

JB: Don't think that is a non-accessibility use case.

JC: The semantics of "accessibility" is a bit in flux.
... preference for low-bandwidth - is that accessibility?

<silvia> https://wiki.mozilla.org/Accessibility/Video_a11y_requirements

CMN: Question - where to do this? PF? HTML Accessibility TF? somewhere else? I suggest HTML accessibility task force

SP: Have written a requirements doc for Mozilla that would be a good basis.

[/me has a brief skim and thinks it is a very good basis to steal things from]

Doug: Should we start from something like this?

MC: We need a "how to do accessibility" and then gap analysis of HTML and what it does now (and what it needs)

DB: We can comment on a document that does this from the SYMM group.

<dsinger> http://www.w3.org/WAI/PF/html-task-force

JS: Makes a lot of sense that we take the work in the HTML Accessibility Task Force.

MC: Concerned that the HTML Accessibility task force has a heavy load.
... concerned that the current make-up of the task force lacks video expertise

JS: Maybe a group within that group would take the work.

Doug: People copy code. Maybe we should also look at the things that people can already do in HTML 5 and show how to do what is already possible.

<jcraig> s/withing/within/

SP: There is a fear that at the next level of HTML5 there is no way we can put more into it. How serious is the prospect of a complete freeze?

<shepazu> creating tutorials also helps discover gaps in functionality

IH: The consideration is not serious - the spec will continue to evolve. The rate-limiter is how fast the browsers implement. It might be that we make a feature-freeze and the work goes into HTML 5+some ... right now the video implementations are still flaky, and interoperability is still poor. By the time that is solved, I see no reason not to add better stuff.

CL: Heard you say that you have code on a private branch. Is there a chicken-and-egg problem?

SP: The trial implementation is available to influence the spec already.

[Process discussion about our process]

<plh> --> http://www.w3.org/2009/10/VideoA11yWorkshop.txt Silvia's slides

JC: Even if we are going with e.g. itext, should we consider requirements on media formats (e.g. the need for dealing with some external captions and some internal captions?

[several]: Yes

CMN: Is there anyone prepared to take on the editing of a requirements and use cases document?

JB: Think we should follow the idea of working in the HTML access task force

DaveS: Think we should try to make it happen there.

MC: This means we need people to join the task force, which means being a member of HTML-WG

SP: I've done requirements work and implementation for itext and tried to validate it. So far still seems good, but I am happy that this goes to others looking and figuring out if it is what we need, or what we need to change.

<dsinger> ACTION: everyone to look at joining the HTML accessibility TF [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action02]

<janina> HTML A11y TF is at:

<janina> http://www.w3.org/WAI/PF/html-task-force

<dsinger> ACTION: everyone to review Silvia's requirements document [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action03]

SP: think itext seems to be going down the right track, but will be making sure that I am not just going off on my own and ignoring what others want and do.

JB: Right. I am concerned about fragmentation - but that is what we keep an eye on in the process of going forward.

DB: Sounds like so far you are happy with what you have - and agree that it's nice. The part of the process that brings things to working groups for comment I want to be sure that there is a way to comment and say changes are needed, if that should be the case.
... also worried about what we do with SRT etc, who owns the problem, how it fits in with re-using the technology, etc.

SP: Sure. This is happening in the HTML WG.

<Zakim> MichaelC, you wanted to say if we want video to be a focused group within HTML accessibility task force, need all the interested people to join (call for participation should go

KH: I'm new to the process, so...
... It sounded like Ian wanted something implemented before it could be accepted as part of the spec.
... So the process is "have an idea, implement, get others to implement". But if we implemented it, do we not have to get anyone else to do so?

IH: Stuff in the spec has to be implemented better before we get to adding more stuff. It definitely helps to have experimental implementations to decide what shuld be the standards. In practice, they get shipped, people start relying on it, and then we are stuck with not breaking that. Which isn't ideal, but the reality is somewhere in the middle

DSing: We don't want the specs or implementations to get too far ahead of each other.

Doug: If a few pages do something wierd, we can insist they change...

IH: Depends on who/what they are.
... poster child was canvas. We found some serious problems, and we had to figure out how to change it. We changed some of it, but then we found more that would cause huge problems to fix it further, and we were kind of stuck with this.

PLH: I don't believe HTML5 can move forward without addressing video accessibility. We have to figure that out. Best we can do is review proposals and provide feedback, not just follow the first implementation and bless it without thinking.
... But it will not be acceptable to simply move forward without getting it right.

DaveS: John and I are volunteering to take responsibility for video accessibility within the HTML Accessibility task force at least for now - chase peopleinto it, get documents together, etc.

<scribe> ACTION: Silvia to put links to existing content into the task force wiki [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action04]

<scribe> ACTION: DaveS and JohnF to take responsibility to drive this work into existence in the HTML Accessibility Task Force [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action05]

THANKS to John, and to Dave, for making it happen.


Summary of Action Items

[NEW] ACTION: DaveS and JohnF to take responsibility to drive this work into existence in the HTML Accessibility Task Force [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action05]
[NEW] ACTION: everyone to look at joining the HTML accessibility TF [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action02]
[NEW] ACTION: everyone to review Silvia's requirements document [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action03]
[NEW] ACTION: John to provide a link that shows us some of how the stanford system works and looks for users [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action01]
[NEW] ACTION: Silvia to put links to existing content into the task force wiki [recorded in http://www.w3.org/2009/11/01-media-minutes.html#action04]
[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.135 (CVS log)
$Date: 2009/11/02 16:53:40 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision: 1.135  of Date: 2009/03/02 03:52:20  
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: RRSAgent_Text_Format (score 1.00)

Succeeded: s/oops/Hiro/
Succeeded: s/multimedia and/Hypertext CG cochair, CSS and/
Succeeded: s/Foliot/Foliot and Stanford workflow for Captioning/
Succeeded: s/pointer into this/link that shows us some of how the stanford system works and looks for users/
Succeeded: s/Lyris/Liris/
Succeeded: s/Bo/Bu/
Succeeded: s/emdia/media/
FAILED: s/ Lewen (sp\?)/Walter Lewin/
Succeeded: s/Cotnent/Content/
Succeeded: s/"l"/"nl"/
WARNING: Bad s/// command: s/www./http:\/\/
Succeeded: s/??/joakim/
Succeeded: s/frustrated/frustrating/
Succeeded: s/by/but/
Succeeded: s/used to be chair/used to be staff contact/
Succeeded: s/substitutes/subsets/
Succeeded: s/Matt: basically/Michael: basically/
Succeeded: s/Matt: there/Michael: there/
Succeeded: s/jeff/Geoff/
Succeeded: s/jeff/Geoff/
Succeeded: s/jeff/Geoff/
Succeeded: s/Bolterman/Bulterman/
Succeeded: s/Bolter/Bulter/
Succeeded: s/@@@/James/
Succeeded: s/@@/Janina/
Succeeded: s/flahs/flash/
Succeeded: s/origion/origin/
Succeeded: s/silvia:/janina:/
Succeeded: s/bu /but /
Succeeded: s/we/web/
Succeeded: s/0.5/0.5%/
Succeeded: s/Takagi/Media Access. platform/
Succeeded: s/choices/choices in audio description/
Succeeded: s/scence/scene/
Succeeded: s/DS: Should/Doug: Should/
Succeeded: s/DS: People/Doug: People/
FAILED: s/withing/within/
Found ScribeNick: chaals
Found Scribe: Chaals
Inferring ScribeNick: chaals
Found ScribeNick: ChrisL2
Found ScribeNick: sylvia
WARNING: No scribe lines found matching ScribeNick pattern: <sylvia> ...
Found ScribeNick: silvia
Found Scribe: Silvia
Inferring ScribeNick: silvia
Found Scribe: fsasaki
Inferring ScribeNick: fsasaki
Found Scribe: Chris

WARNING: "Scribe: Chris" command found, 
but no lines found matching "<Chris> . . . "
Continuing with ScribeNick: <fsasaki>
Use "ScribeNick: dbooth" (for example) to specify the scribe's IRC nickname.

Found Scribe: Chaals
Inferring ScribeNick: chaals
Scribes: Chaals, Silvia, fsasaki, Chris
ScribeNicks: chaals, ChrisL2, sylvia, silvia, fsasaki
Present: Judy_Brewer Dick_Bulterman Sally_Cain Eric_Carlson James_Craig Michael_Cooper Marisa_DeMeglio John_Foliot Geoff_Freed Ken_Harrenstien Philippe_le_Hégaret Ian_Hickson Chris_Lilley Charles_McCathieNevile Matt_May Frank_Olivier Silvia_Pfeiffer Janina_Sajka Felix_Sasaki David_Singer Joakim_Söderberg Hironobu_Takagi Victor_Tsaran Doug_Schepers
Got date from IRC log name: 01 Nov 2009
Guessing minutes URL: http://www.w3.org/2009/11/01-media-minutes.html
People with action items: daves everyone john johnf silvia
[End of scribe.perl diagnostic output]