Media & Entertainment IG meeting

Meeting minutes

Introduction

ChrisN: Main topic is DAPT
… Some discussion on the group charter with the APA WG

... expected collaboration on media requirements
… discussion during TPAC in September

ChrisN: Charter should go to the AC soon, I hope

Kaz: I'll bring it to the team strategy meeting next Tuesday

Dubbing and AD Profile of TTML2

<kaz> Dubbing and Audio description Profiles of TTML2 WD

Nigel: No slides, but here's the document
… This is a requirements document
… This is work on an exchange format for workflows for producing dubbing scripts and audio description scripts
… It defines in an exchangeable document the mixing instructions to produce a version of the video with AD mixed in
… Wanted to get this work going for a number of years, we had a CG with not enough momentum
… Combining the AD and dubbing work created more momentum
… This is a profile of TTML2. TTML2 provides tools for timed text documents, it has lots of features, e.g., styling for captions or subtitles
… We don't use those styling features particularly, but the semantic basis of TTML, to make the work of creating this spec a lot easier
… The use cases are in the requirements document from 2022
… The spec document is on the Rec track, in TTWG, intended to meet the requirements in the requirements doc
… The diagram shows where it fits in. An audio program with key times. Then you might be creating a description of images, a script that describes what's in the video at relevent times
… Or you might transcribe then translate it, to create a dubbing script
… In both cases, you have a pre-recording script
… You might record it with actors, or use text-to-speech
… Then create instructions for mixing
… You end up with a manifest file that includes everything spoken and times, with links to recorded audio and mixing instructions

ChrisN: Is this mainly used in a production setting, what about for playback in a client?

Nigel: In dubbing workflows, what localisation teams do in the production domain is send the media to someone to do the dubbing
… The script is useful when making edits. They also send it to a subtitling house, to create a subtitle file that would be presented to the audience
… The language translation differs significantly between dubbing and subtitles
… If the words don't match, it's terrible. Cyril Concolato is the co-editor, he described having this experience
… He couldn't have the subtitles with the dubbed audio, not a good experience
… Once you have the translated timed text, you can send it to be turned into subtitles
… Then it's about changing styling, showing the right number of words at a time, shot changes, etc
… Because they have a common source for translation, you didn't have to pay to get that done twice
… If you have the script and mixing instructions available, in cases where people can't see the image, you can have client side rendering of the AD
… That allows you to change the relative balance of AD and programme audio
… If you have the text available, you don't have to render as audio, it could be a Braille display
… They get the description of the video through the fingers, and dialog through their ears
… There's cognitive load using AD, to distinguish the description from the dialog
… Some people can't hear at all, so having all the video description available as text would help people who can't see or hear
… Then you get a reasonable description of the entire program

Nigel: We have a working draft document, steadily being updated
… The DAPT is the spec document. The intent is the user can understand how it works without being expert in TTML
… We use TTML for the underlying structures. We expect it to be useful for tool implementers and player implementers, rather than creating by people in a text editor, which would be hard work
… Transcripts of pre-existing media and Scripts for media to be created
… Let's look at some examples
… It's an XML document with metadata in the head, and body with main content
… You could create empty <divs> in the body. TTML has ideas in common with HTML, like head, body, p
… From an AD, you have a p with start and end time, describing something in the video image
… Care has been taken to be clear about language and the source of language. Important to know what state we're in
… It uses the xml:lang tag, content profile designator. Example of a transcript of original language audio in the programme
… In the AD example, the source of the language is original, so it's not a translation
… If I create an audio recording of this description, call it clip3.wav. I can refernce it with an audio element
… So there's a paragraph with an animated gain value, going from 1 to 0.39
… This is commonly used in AD, to lower the programme audio volume before the AD enters, then return the gain to 1
… Another example, we use a tta:speak attribute to direct a player to use text to speech
… You can include the audio as base64 encoding. Challenge of identifying WAV audio as it doesn't have a proper mime type
… You can send them to a player for playback
… For dubbing, there's metadata for character names (and actor names), using existing TTML
… Once translated, the script type changes in the metadata, to show it as a translation
… The original can be kept, which is important to be able to refer back. Or for multiple translations
… A Pivot language to go from Finnish to English, then English to Hebrew, for example
… If you get words translated strangely, you can go back and adjust
… Get lip-sync timings right
… The single document gets worked on, updated to reflect the status
… As it's an open standards, as opposed to non-standard or proprietary, we hope to create a marketplace for interoperability between tools
… That's the benefit to having a W3C spec

Nigel: We have a data model (UML class diagram) in the spec
… A DAPT script has metadata, it can contain characters and styles
… We may remove styles applied to particular characters. Debate whether it needs to be in the document
… script events contain timed entities. The three main things are: script contains text, contains events
… You can apply mixing instructions, audio recording or text to speech
… Those are the main entities. The rest of the spec describes them in detail
… It explains how the data model maps to the TTML, e.g., a <div> element with xml id
… A p element represents text. You can have styles and metadata
… You can have audio styles to trigger synthesised audio, describe if original language or translation
… The audio is designed to be implementable using Web Audio - not the text to speech, the Web Speech API isn't a Rec, and problematic to use it here as typical implementations can't bring the speech into a Web Audio context
… It goes out directly via the OS
… But there are other ways to do it, e.g., online services where you post the text and they return audio
… There's a conformance section that describes formal stuff
… Because it's a TTML2 profile, there's formal stuff in the Appendices on what's required, and optional, and extension features
… TTML2 allows extensions, as "features" of specifications. This is helpful for designing CR exit criteria, as we know the features
… So conformance criteria is easy generate, and tests

<nigel> DAPT

<nigel> DAPT-REQS

Nigel: We're in working draft now, and next stage is Candidate Recommendation
… So getting wide review is important. So it's the perfect time to review and give feedback
… I think it's in a reasonable state to do that
… If you have questions or comments, please open an issue in GitHub, or comment on any open issues
… You can email the TTWG mailing list too, or directly with me or Cyril
… Next steps, we'll respond to all review comments, we'll be starting horizontal review soon, and sending liaisons to groups in the TTWG charter

ChrisN: Who are the external groups, and which companies?

Nigel: Some tool suppliers have been very positive about having an open standard format. They have a common problem in how to serialize their work
… If an organisation like BBC or Netflix wants to specify an AD or dubbing script as a deliverable from a provider, they can require the standard format
… The alternative, ingesting spreadsheets or CSVs is painful

Kaz: Thank you, this is an interesting and useful mechanism
… I'm wondering about the order of sentences. Japanese has a different ordering, subject, verb, object
… So if we simply dub or translate English to Japanese, subtitles tend to split into smaller chunks, subject first
… So how to deal with that kind of word order difference?
… Second question, could work on SSML as well as the current text spec, so TTML engine can handle the text information well

Nigel: On the second point, there's good advantage to using SSML. We have an issue (#121)
… It's a challenge to mix in with the audio content
… Somebody else working on a similar problem, spoken presentation in HTML, has found the same issue
… They considered different ways to proceed; multiple attributes or a single attribute
… https://www.w3.org/TR/spoken-html/
… I'd be interested to know how to do that well

Kaz: I need to organise a workshop or event on voice. This should be good input, so please let's work with the strategy team on that

Nigel: Yes, that would be good
… On your first question, some things are covered by TTML, e.g., different writing directions and embedding
… Japanese language requirements like rubies and text emphasis can be done
… But your point is more about the structure of the language itself
… We didn't feel the need to say anything, as it just contains whatever the translator generates
… Is there more we need to do?

Kaz: Don't know, maybe you can use the begin and end attribute to specify the timing of the Japanese word to be played

Nigel: Yes, you could do that

Nigel: Timed text elements inside these objects is permitted. p elements can contain spans, and spans can have timing
… That's important for adaptation

ChrisN: This is really good progress, now ready for review

Nigel: I see people from Japan are here, if you have particular use cases, I would be interested in your review feedback
… Some things may not be obvious to me as a non-Japanese speaker

Kaz: We have people from NHK here today
… NHK Open House was on Sunday, some are working on sign language synchronised with TV content, they may be interested in this mechanism and modalities like sign language

Nigel: What may be attractive, is it is extensible, you could present translation text to a sign interpreter

ChrisN: Also could be interseting from an object based media point of view
… Anything to follow up?

Zoom call for the July meeting

kaz: We need to switch to Zoom, so I'll allocate one for the July meeting

[adjourned]

– DRAFT –
Media & Entertainment IG meeting

06 June 2023

Attendees

Meeting minutes

Introduction

Dubbing and AD Profile of TTML2

Zoom call for the July meeting