15620 – <track> WebVTT: Rename "rules for its interpretation" to something more explicit

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 15620 - <track> WebVTT: Rename "rules for its interpretation" to something more explicit

Summary: <track> WebVTT: Rename "rules for its interpretation" to something more explicit

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-01-19 04:34 UTC by Silvia Pfeiffer
Modified:	2013-03-18 19:43 UTC (History)
CC List:	5 users (show)

See Also:

Attachments

Description Silvia Pfeiffer 2012-01-19 04:34:37 UTC

The WebVTT specification in its current format requires a WebVTT cue text parser to generate a tree of WebVTT node objects. In the case of WebVTT files that contain kind="metadata" data such parsing is not appropriate and the parser should not be required to build the VTT node tree. We don't currently offer a means in the WebVTT spec to obtain the original unparsed cue text.

Since WebVTT file should be usable independent of whether they are linked from a HTML <track> element or associated with a video through some other means, it would be best to have the @kind information inside the VTT file itself. For example if somebody wants to use a metadata track with JSON, it's not appropriate that the parser create a node tree. We want authors to be able to create cue text for kind="metadata" WebVTT files without having to escape their custom content.

Comment 1 Ian 'Hixie' Hickson 2012-04-25 22:20:58 UTC

I'm not sure what the issue here is. The two paragraphs above seem like they're describing separate issues. Can you elaborate on what issue this bug is specifically about?

Comment 2 Silvia Pfeiffer 2012-04-30 00:59:54 UTC

It's about the problem of generating a tree of WebVTT node objects for @kind=metadata resources as described in the first paragraph. The need for a kind metadata element is covered in bug 15851. The JSON example explains the problem described in the first paragraph. Parsing JSON into a WebVTT node object tree will likely result in utter rubbish.

Comment 3 Ian 'Hixie' Hickson 2012-07-24 05:51:56 UTC

Why would you parse the cue from a metadata track into the tree if it wasn't in the WebVTT markup format? I don't understand the problem here at all. Could you elaborate on what specific requirement in the specification would need to change to alleviate the problem you describe? Maybe that would help me understand what the problem is.

Comment 4 Philip Jägenstedt 2012-07-24 08:50:54 UTC

(In reply to comment #0)
> The WebVTT specification in its current format requires a WebVTT cue text
> parser to generate a tree of WebVTT node objects. In the case of WebVTT files
> that contain kind="metadata" data such parsing is not appropriate and the
> parser should not be required to build the VTT node tree. We don't currently
> offer a means in the WebVTT spec to obtain the original unparsed cue text.

cue.text returns the unparsed text, right?

Comment 5 Silvia Pfeiffer 2012-07-25 03:43:29 UTC

The WebVTT parser specification at
http://dev.w3.org/html5/webvtt/#webvtt-parser-algorithm
states:

"47. Cue text processing: Let the text track cue text of cue be cue text, and let the rules for its interpretation be the WebVTT cue text parsing rules, the WebVTT cue text rendering rules, and the WebVTT cue text DOM construction rules."

I thought it implied that the WebVTT cue text parsing rules and the DOM construction rules *have* to be applied. However, it just says "let the rules for its interpretation be...".

The WebVTT syntax description specifies three different types of WebVTT cue payload: http://dev.w3.org/html5/webvtt/#cue-payload 

It's not explicitly stated that the application of the parsing rules depends on the type of WebVTT cue payload - it's this connection that I'm missing.

It would help to clarify that the cue text parsing rules http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules , DOM construction http://dev.w3.org/html5/webvtt/#webvtt-cue-text-dom-construction-rules and the cue text rendering rules http://dev.w3.org/html5/webvtt/#webvtt-cue-text-rendering-rules only apply to "WebVTT cue text" types of payload.

Comment 6 Ian 'Hixie' Hickson 2012-08-06 22:25:49 UTC

(In reply to comment #5)
> The WebVTT parser specification at
> http://dev.w3.org/html5/webvtt/#webvtt-parser-algorithm
> states:
> 
> "47. Cue text processing: Let the text track cue text of cue be cue text, and
> let the rules for its interpretation be the WebVTT cue text parsing rules, the
> WebVTT cue text rendering rules, and the WebVTT cue text DOM construction
> rules."
> 
> I thought it implied that the WebVTT cue text parsing rules and the DOM
> construction rules *have* to be applied. However, it just says "let the rules
> for its interpretation be...".

There is no implication in a spec. Don't read between the lines. :-)

The spec says what it means, and no more. If we were to start requiring that implementations do thing that the spec doesn't explicitly require, then we'd end up in the 90s again with specs like HTML4 and with no interoperability.


> The WebVTT syntax description specifies three different types of WebVTT cue
> payload: http://dev.w3.org/html5/webvtt/#cue-payload 
> 
> It's not explicitly stated that the application of the parsing rules depends on
> the type of WebVTT cue payload - it's this connection that I'm missing.

There is no connection. The payload types are only authoring conformance requirements, they have no impact on user agents.


> It would help to clarify that the cue text parsing rules
> http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules , DOM
> construction
> http://dev.w3.org/html5/webvtt/#webvtt-cue-text-dom-construction-rules and the
> cue text rendering rules
> http://dev.w3.org/html5/webvtt/#webvtt-cue-text-rendering-rules only apply to
> "WebVTT cue text" types of payload.

The rules associated with parsing cue text are only used when the spec says they are used, specifically in two places: getCueAsHTML() and determining chapter names. The rules for rendering are used by the rendering requirements, if I recall correctly (and those rules, for WebVTT, apply the WebVTT rules directly without checking what the associated rules are, since they'd never apply if it wasn't WebVTT.)


I'm not really sure what it is you want the spec to say here.

Comment 7 Silvia Pfeiffer 2012-08-23 05:43:21 UTC

I'm not suggesting any change to the HTML spec - I was only talking about the WebVTT spec, in particular when used independently of the HTML spec.

Right now the WebVTT spec explicitly says:
"A WebVTT parser, given an input byte stream and a text track list of cues output, must decode the byte stream as UTF-8, with error handling, and then must parse the resulting string according to the WebVTT parser algorithm below."

Since not every WebVTT parser needs to do this, we need to be more careful here. We don't have a cue.text property or cue.getCueAsHTML() function in the WebVTT spec.


I've now captured the first half of the request in a separate bug 18657 , asking for Kind=xxx WebVTT metadata.

As a consequence of this, we should also mention that the parsing in the cue text parsing rules
http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules
only apply if you are dealing with a Kind of "Subtitles", "Captions", or "Chapters". For "Descriptions" or "Metadata" we need a simpler parsing algorithm that just pulls out the text.

Comment 8 Ian 'Hixie' Hickson 2012-10-26 23:38:35 UTC

(In reply to comment #7)
> I'm not suggesting any change to the HTML spec - I was only talking about
> the WebVTT spec, in particular when used independently of the HTML spec.

They're the same spec, as far as I'm concerned. The WebVTT section of the HTML spec just happens to be published separately at the moment for political reasons.


> Right now the WebVTT spec explicitly says:
> "A WebVTT parser, given an input byte stream and a text track list of cues
> output, must decode the byte stream as UTF-8, with error handling, and then
> must parse the resulting string according to the WebVTT parser algorithm
> below."
> 
> Since not every WebVTT parser needs to do this

Yes they do. What WebVTT parser wouldn't need to do this? I'm confused.


> As a consequence of this, we should also mention that the parsing in the cue
> text parsing rules
> http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules
> only apply if you are dealing with a Kind of "Subtitles", "Captions", or
> "Chapters". For "Descriptions" or "Metadata" we need a simpler parsing
> algorithm that just pulls out the text.

The cue text parsing rules apply to whoever wants to get cue text out of the cues. It can just as easily apply to Metadata tracks, Description tracks, or indeed types of tracks that <track> doesn't know about. I don't understand the request here.

Comment 9 Silvia Pfeiffer 2012-10-27 00:02:57 UTC

(In reply to comment #8)
>
> > As a consequence of this, we should also mention that the parsing in the cue
> > text parsing rules
> > http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules
> > only apply if you are dealing with a Kind of "Subtitles", "Captions", or
> > "Chapters". For "Descriptions" or "Metadata" we need a simpler parsing
> > algorithm that just pulls out the text.
> 
> The cue text parsing rules apply to whoever wants to get cue text out of the
> cues. It can just as easily apply to Metadata tracks, Description tracks, or
> indeed types of tracks that <track> doesn't know about. I don't understand
> the request here.

Are you saying that the WebVTT cue text parsing rules, the WebVTT cue text rendering rules, and the WebVTT DOM construction rules are optionally applied to tracks? If so, we should say so.

However, I read:
"To parse a string input supposedly containing WebVTT cue text, user agents must use the following algorithm."

This indicates to me that software that implements support for WebVTT must parse cue text using the WebVTT cue text parsing rules. This means for example that if a metadata track uses XML markup, any tag that does not fall withing the given set for WebVTT cues (c, v, i, b, u etc) is dropped during parsing.

Comment 10 Philip Jägenstedt 2012-10-29 08:48:53 UTC

(In reply to comment #9)
> (In reply to comment #8)
> >
> > > As a consequence of this, we should also mention that the parsing in the cue
> > > text parsing rules
> > > http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules
> > > only apply if you are dealing with a Kind of "Subtitles", "Captions", or
> > > "Chapters". For "Descriptions" or "Metadata" we need a simpler parsing
> > > algorithm that just pulls out the text.
> > 
> > The cue text parsing rules apply to whoever wants to get cue text out of the
> > cues. It can just as easily apply to Metadata tracks, Description tracks, or
> > indeed types of tracks that <track> doesn't know about. I don't understand
> > the request here.
> 
> Are you saying that the WebVTT cue text parsing rules, the WebVTT cue text
> rendering rules, and the WebVTT DOM construction rules are optionally
> applied to tracks? If so, we should say so.
> 
> However, I read:
> "To parse a string input supposedly containing WebVTT cue text, user agents
> must use the following algorithm."
> 
> This indicates to me that software that implements support for WebVTT must
> parse cue text using the WebVTT cue text parsing rules. This means for
> example that if a metadata track uses XML markup, any tag that does not fall
> withing the given set for WebVTT cues (c, v, i, b, u etc) is dropped during
> parsing.

When inspected via TextTrackCue.getCueAsHTML(), yes, but the original text is available in TextTrackCue.text so that one can feed it into an XML/JSON/whatever parser.

Comment 11 Silvia Pfeiffer 2012-10-29 13:30:27 UTC

(In reply to comment #10)
>
> When inspected via TextTrackCue.getCueAsHTML(), yes, but the original text
> is available in TextTrackCue.text so that one can feed it into an
> XML/JSON/whatever parser.

Agreed, but that's in the HTML spec and the WebVTT spec doesn't mention this possibility at all.

Comment 12 Ian 'Hixie' Hickson 2012-10-29 21:48:03 UTC

This is why they shouldn't be artificially split into two specs.

Comment 13 Silvia Pfeiffer 2012-10-29 23:26:25 UTC

Even if they were in the same spec, it'd be nice if it was explicitly stated that it was allowed to offer the content of a cue without applying the parser. This is right now a possibility that you have to assume when you read "TextTrackCue.text". The WebVTT spec doesn't mention this possibility at all - in fact it downright denies that use by saying:

"To parse a string input supposedly containing WebVTT cue text, user agents must use the following algorithm."

I'm only asking for a small amend.

Comment 14 Philip Jägenstedt 2012-10-30 08:39:40 UTC

But what could a non-browser implementation of WebVTT do with a metadata track, which is what we're talking about here? It can't run the scripts that are needed to interpret the data...

Comment 15 Silvia Pfeiffer 2012-10-30 08:44:30 UTC

It can have its own scripts/processing of the data - it doesn't have to be a Browser that has to interpret the cue text. It's simply a means to transport time-aligned data and it's up to the player to make something of it.

For example, I can imagine the QuickTime player interpreting ads delivered in a WebVTT track with an MP4 file. Or a sports scoreboard.

Comment 16 Philip Jägenstedt 2012-10-30 09:05:12 UTC

(In reply to comment #15)
> It can have its own scripts/processing of the data - it doesn't have to be a
> Browser that has to interpret the cue text. It's simply a means to transport
> time-aligned data and it's up to the player to make something of it.
> 
> For example, I can imagine the QuickTime player interpreting ads delivered
> in a WebVTT track with an MP4 file. Or a sports scoreboard.

If such a file uses the QuickTime container to encode the timing (not WebVTT timestamp strings) and uses a custom format (not the WebVTT cue text syntax), in what sense is that still WebVTT?

Comment 17 Silvia Pfeiffer 2012-10-30 09:57:56 UTC

(In reply to comment #16)
>
> If such a file uses the QuickTime container to encode the timing (not WebVTT
> timestamp strings) and uses a custom format (not the WebVTT cue text
> syntax), in what sense is that still WebVTT?

Oh no, I wasn't talking about encoded inside a QuickTime container. I was just talking about the QuickTime player, which would receive a media file and synchronize a WebVTT file to it.

Comment 18 Philip Jägenstedt 2012-10-30 10:19:07 UTC

Oh, OK. If it has built-in special handling of some WebVTT files, how would it know how to interpret an arbitrary WebVTT file given to it? Is this a hypothetical scenario, or is some non-Web player actually interested in using WebVTT in this way?

Comment 19 Silvia Pfeiffer 2012-10-30 10:47:32 UTC

(In reply to comment #18)
> Oh, OK. If it has built-in special handling of some WebVTT files, how would
> it know how to interpret an arbitrary WebVTT file given to it? Is this a
> hypothetical scenario, or is some non-Web player actually interested in
> using WebVTT in this way?

That's the subject of another bug: https://www.w3.org/Bugs/Public/show_bug.cgi?id=18657

It's not hypothetical, after all: WebVTT is not just a file format for captions, but for other types of timed text content, too. Other players want to be able to interpret WebVTT files that contain metadata, descriptions or chapters, too.

Comment 20 Philip Jägenstedt 2012-10-30 12:40:54 UTC

To go slightly off-topic, I disagree that WebVTT is just another captioning format, it's a format developed specifically to be part of the Web platform by taking advantage of other parts of the platform, in particular for metadata tracks. When catering to a non-Web use case would make the format more complicated in any way, my vote is on not solving that use case. (We've had this discussion several times before and apparently aren't going to agree.)

As for the possible clarifications about parsing, I'm fine as long as the exact same rules apply to metadata tracks and non-metadata tracks, since that is already the case and the kind can be modified dynamically.

Comment 21 Ian 'Hixie' Hickson 2012-10-30 17:38:31 UTC

I'm with foolip on this. If a vendor wants to use WebVTT for their own proprietary magic that's fine, but it's not WebVTT, it's WebVTT+their extensions, and the spec text for that belongs in the spec for their extensions.

I don't see that there's anything to spec here.

Comment 22 Silvia Pfeiffer 2012-10-31 00:10:32 UTC

(In reply to comment #21)
> I'm with foolip on this. If a vendor wants to use WebVTT for their own
> proprietary magic that's fine, but it's not WebVTT, it's WebVTT+their
> extensions, and the spec text for that belongs in the spec for their
> extensions.

Agreed. It's, however, not what I have asked for.


(In reply to comment #22)
> When catering to a non-Web use case would make the format more complicated in
> any way, my vote is on not solving that use case.

I am not asking to make the format more complicated at all.

I am instead asking to be able to take the WebVTT spec as a stand-alone spec that does not require understanding the rest of the Web platform to be able to read it and interpret it correctly. I continue to come across issues in the spec where you say: no, it's not explained in this spec but in this other spec over here which we generally reference.

This is very frustrating.

I want to be able to point video accessibility developers at the WebVTT spec and have them rely on what is stated in there as being essentially complete. Yes, there are some links back to relevant sections in the HTML spec, which is ok, but we can't rely spec something in the WebVTT spec that we contradict in the HTML spec.

This is the case here.

The WebVTT spec says to ONLY interpret cut text with the cue text parsing rules - there is no statement that other interpretations are possible. In contrast, the HTML spec (implicitly) says that there is a way to get to the cue text without parsing it.

My concerns would be addressed by a simple clarification.

For example, if we changed step 47 in: http://dev.w3.org/html5/webvtt/#parsing

from

47. Cue text processing: Let the text track cue text of cue be cue text, and let the rules for its interpretation be the WebVTT cue text parsing rules, the WebVTT cue text rendering rules, and the WebVTT cue text DOM construction rules.

to something like

47. Cue text processing: Let the text track cue text of cue be cue text. The rules for interpretation of the cue text depend on the use cases of the player. HTML exposes the plain cue text, makes use of the WebVTT cue text parsing rules and the WebVTT cue text DOM construction rules to expose a DocumentFragment constructed from the cue text, and makes use of the WebVTT cue text rendering rules to render the DocumentFragment for WebVTT files of kind subtitle and captions.

Maybe we can even add a statement of extensibility:
Other specifications may decide to introduce other rules for the parsing and rendering of WebVTT cue text of different WebVTT kinds.

Comment 23 Ian 'Hixie' Hickson 2012-10-31 19:55:15 UTC

"the rules for interpretation" of the text of _all_ WebVTT cues is "the WebVTT cue text parsing rules", whether it's a chapter track, caption track, metadata track, or whatever. _All that means_ is that when the track is handled by e.g. the rendering section, or getCueAsHTML(), it uses the WebVTT cue text parsing rules; and if the track is a TTML track, then it has different parsing and interpretation rules. This has nothing to do with anything else.

Would it help if we called it something else? The "rules for interpreting the cue in case you have to render the cue or convert it to HTML for whatever reason", rather than just the "rules for interpreting the cue"?


> I am instead asking to be able to take the WebVTT spec as a stand-alone spec
> that does not require understanding the rest of the Web platform to be able to 
> read it and interpret it correctly.

I do not share this goal. WebVTT is part of the Web platform, hence its name.

You can't interpret WebVTT without reading this and what it relies on:

   http://www.whatwg.org/specs/web-apps/current-work/#text-track-cue

I don't really understand what the confusion is here, though. Why would this affect standalone implementors in the slightest? I don't understand what sequence of events makes this ambiguous for them.

Comment 24 Silvia Pfeiffer 2012-10-31 21:06:35 UTC

(In reply to comment #23)
> Would it help if we called it something else? The "rules for interpreting
> the cue in case you have to render the cue or convert it to HTML for
> whatever reason", rather than just the "rules for interpreting the cue"?

Yes, that would make a big difference. As it stands, it looks like these rules are the only and exclusive way of interpreting cues.

Comment 25 Ian 'Hixie' Hickson 2012-10-31 22:14:00 UTC

Ok, I'll name it something more explicit.

Comment 26 Silvia Pfeiffer 2013-03-17 23:00:54 UTC

"rules for rendering the cue in isolation" were added in
http://html5.org/tools/web-apps-tracker?from=7747&to=7748
to fix this.

I am still struggling with this term.

I think it's because it includes both the "DOM construction" and the "rendering" of chapter cues.

In contrast, "rules for updating the display of WebVTT text tracks", which is for rendering of cues with video, focuses only on the rendering.

Would it be ok if I rearranged that a bit for chapters in the WebVTT spec and possibly renamed it (away from the confusing "rendering")?

Comment 27 Silvia Pfeiffer 2013-03-17 23:05:05 UTC

I should probably add that I expect chapters to be rendered by default in menus or navigation markers as part of the video controls. Just like any other TextTrackCue, however, a Web dev could render them also anywhere else on the page.

Comment 28 Ian 'Hixie' Hickson 2013-03-18 19:43:07 UTC

I don't understand what you are proposing.

The rules are only needed for UAs. There's two kinds of rules, rules for rendering cues together on a video, and rules for rendering cues in isolation. What's the problem?