13943 – <track> The "bad cue" handling is stricter than it should be

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13943 - <track> The "bad cue" handling is stricter than it should be

Summary: <track> The "bad cue" handling is stricter than it should be

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	Other other

Importance:	P3 blocker
Target Milestone:	Unsorted
Assignee:	Ian 'Hixie' Hickson
QA Contact:	contributor

URL:	http://www.whatwg.org/specs/web-apps/...
Whiteboard:	webvtt
Keywords:

Depends on:
Blocks:

Reported:	2011-08-29 08:35 UTC by contributor
Modified:	2012-07-18 18:42 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description contributor 2011-08-29 08:35:04 UTC

Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/the-video-element.html
Multipage: http://www.whatwg.org/C#parsing-0
Complete: http://www.whatwg.org/c#parsing-0

Comment:
The "bad cue" handling is stricter than it should be

Posted from: 83.218.67.122 by philipj@opera.com
User agent: Opera/9.80 (X11; Linux x86_64; U; Edition Next; en) Presto/2.9.186 Version/12.00

Comment 1 Philip Jägenstedt 2011-08-29 08:38:35 UTC

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2011-June/031916.html

In short, the following cue should not be skipped:

1
2
00:00.000 --> 00:01.000
Bla

Suggest changing steps 32-40 to not sniff for "-->" in the input but simply try to "collect WebVTT cue timings and settings" and if that fails treat the line as the cue identifier. This still allows us to add extensions before the identifier line, but not after it.

Comment 2 Ian 'Hixie' Hickson 2011-08-29 16:55:30 UTC

Why shouldn't it be skipped?

Comment 3 Philip Jägenstedt 2011-08-30 07:57:47 UTC

It's a relatively easy authoring mistake to make while copy+pasting cues around, but noticing that a single cue has gone missing in the output is very difficult.

The current spec makes it impossible to extend the syntax in a backwards-compatible way. The only extension it allows is commenting out individual cues, but why would we want to do that per cue rather than commenting out larger blocks of text with /* */ or similar?

(We also don't drop the entire cue when there are unknown settings in the timing line or unknown/misnested tags in the cue text.)

Comment 4 Ian 'Hixie' Hickson 2011-09-10 04:27:44 UTC

It makes all kinds of extensions possible, for example it allows configuration default blocks, etc, all by just having the syntax not include a valid time range in the second line of the block. Your proposal (which if I understand correctly is to just keep trying to read a time line until it's successful?) would preclude that. So far, all the extensions proposed have been of this nature, rather than of the nature of adding features to a cue  which we can already do anyway, by adding more settings after the time range (which is why invalid settings are ignored, and don't cause the cue to be dropped).

Comment 5 Philip Jägenstedt 2011-09-10 16:31:42 UTC

(In reply to comment #4)
> It makes all kinds of extensions possible, for example it allows configuration
> default blocks, etc, all by just having the syntax not include a valid time
> range in the second line of the block. Your proposal (which if I understand
> correctly is to just keep trying to read a time line until it's successful?)
> would preclude that. So far, all the extensions proposed have been of this
> nature, rather than of the nature of adding features to a cue  which we can
> already do anyway, by adding more settings after the time range (which is why
> invalid settings are ignored, and don't cause the cue to be dropped).

I don't follow, at all. For default blocks in the beginning of the file, why would it include a timing line at all? It would be skipped entirely. For additional settings on cues on lines before or after the id line, dropping the cue entirely seems like terrible fallback.

Comment 6 Ian 'Hixie' Hickson 2011-09-14 22:19:46 UTC

Right now we can add comment blocks that can come anywhere in the file by using syntax like:

--> COMMENT
this is a comment
even if it includes
timing lines like 
00:00.000 --> 00:01.000
this; comment block
ends at next newline.

Or we can add default settings blocks, which can come anywhere in the file. For example, here we have a hypothetical default block which sets the defaults for the first hour and the second hour of the file:

--> DEFAULT
00:00.000 --> 01:00.000
A:start
01:00.000 --> 02:00.000
A:end

I don't understand why this extensibility would be less valuable than recovering from cases where a user has accidentally given two IDs to a cue. How common is that going to be, realistically speaking? It seems like a very odd error to optimise for.

Comment 7 Philip Jägenstedt 2011-09-16 12:02:41 UTC

Whether or not "-->" is a magic piece of syntax is really orthogonal to the issue here. After a blank line, just treat each line that does not contain "-->" as the id line, effectively using the last line before the timing line as the id.

In general, I think we should make the parser as lax as possible without sacrificing useful extensibility surface. If cues or part of cues are dropped because of syntax errors, it's very likely that this will go unnoticed, since you typically won't be proof-watching the entire captions after making edits. We discussed more issues like this at OVC and will probably be filing bugs to relax the constrains on the timing format, unclosed left angle brackets (<) and perhaps & escapes. (This goes against my earlier insistence on not allowing commas as the decimal separator, at that time I wasn't really considering hand authoring.)

Comment 8 Ian 'Hixie' Hickson 2011-09-19 22:55:31 UTC

Oh, I misunderstood what you were asking for. You just want to keep looking for a line with the --> marker or a blank line, instead of only allowing the --> marker in the first or second line of a cue?

I don't really understand what authoring mistake is going to end up with that being necessary. Why do you think multiple IDs is a likely error? Surely most authors wouldn't provide any IDs at all.

Comment 9 Philip Jägenstedt 2011-09-20 09:22:46 UTC

(In reply to comment #8)
> Oh, I misunderstood what you were asking for. You just want to keep looking for
> a line with the --> marker or a blank line, instead of only allowing the -->
> marker in the first or second line of a cue?

Yes.

> I don't really understand what authoring mistake is going to end up with that
> being necessary. Why do you think multiple IDs is a likely error? Surely most
> authors wouldn't provide any IDs at all.

I don't think it's a likely error, I just think the parser should be robust and not throw away cues when it would be just as easy to recover.

(I would also like some consistency on this point, fixing just this while leaving other aspects of the parser very strict would admittedly be arbitrary.)

Comment 10 Ian 'Hixie' Hickson 2011-09-20 20:04:34 UTC

I don't understand your use of the terms "strict", "robust", and "recover". Allowing syntactically incorrect blocks isn't strict. Ignoring them is robust. Not ignoring the next block is how we recover.

Parsers for Web language should be designed to be forward-compatible, which means ignoring content that doesn't match the syntax in a well-defined manner, so that future extensions can use these syntax "holes" to add new features in a predictable way. Parsers should handle common authoring errors in a way that matches author intent or that does nothing, but there is no need to recover from errors that aren't going to be common — it would just encourage authors to write bad code that might change meaning in the future. Parsers should avoid actively handling (i.e. not ignoring) author mistakes in ways that are likely to differ from author intent.

You have studied SRT data, so you have a good idea of what authoring mistakes are common; your advice here would be most welcome. However, if the case you are talking about here is not a common error, then I don't see any value (and I see some negatives) to trying to automatically work around it.

Comment 11 Simon Pieters 2011-09-27 12:37:14 UTC

I agree with Philip. The parser shouldn't drocanianly drop cues for trivial authoring mistakes. I don't know if we need to polish the parsing of the id (though I don't mind that), but certainly the timestamp parsing needs polishing. From what I remember when looking at SRT content, it's not uncommon to have various mistakes in the timestamp. Usually you don't notice the error (until you validate the file or check a browser's error console). The parsing of timestamps should be DWIM (doesn't need to be compatible with SRT implementations though).

1:01.000 = 01:01.000
01:1.000 = 01:01.000
01:01,000 = 01:01.000
01:01.5 = 01:01.500
01:01.5000 = 01:01.500
01:61.000 = 02:01.000
etc

Comment 12 Silvia Pfeiffer 2011-09-28 01:39:02 UTC

> > I don't really understand what authoring mistake is going to end up with that
> > being necessary. Why do you think multiple IDs is a likely error? Surely most
> > authors wouldn't provide any IDs at all.
> 
> I don't think it's a likely error, I just think the parser should be robust and
> not throw away cues when it would be just as easy to recover.
> 
> (I would also like some consistency on this point, fixing just this while
> leaving other aspects of the parser very strict would admittedly be arbitrary.)

How many lines do you want to parse before expecting a line with "-->" in it? This has the potential danger that if somebody really screwed up their WEBVTT file and it's a very long file that we end up parsing the whole file before we notice that there aren't any cues. Is that desirable?

Comment 13 Silvia Pfeiffer 2011-09-28 01:45:07 UTC

(In reply to comment #11)
> I agree with Philip. The parser shouldn't drocanianly drop cues for trivial
> authoring mistakes. I don't know if we need to polish the parsing of the id
> (though I don't mind that), but certainly the timestamp parsing needs
> polishing. From what I remember when looking at SRT content, it's not uncommon
> to have various mistakes in the timestamp. Usually you don't notice the error
> (until you validate the file or check a browser's error console). The parsing
> of timestamps should be DWIM (doesn't need to be compatible with SRT
> implementations though).
> 
> 1:01.000 = 01:01.000
> 01:1.000 = 01:01.000
> 01:01,000 = 01:01.000
> 01:01.5 = 01:01.500
> 01:01.5000 = 01:01.500

Maybe.

> 01:61.000 = 02:01.000

It would be bad if we accepted this kind of mistake IMHO. Think e.g. about 01:5432153.000 - that's neither readable nor makes it much sense at all as a time format.


I'm in two minds about this.

On the one hand, allowing simpler time formats (such as just seconds.milliseconds) would be a nice simplification to allow and makes it easier to convert from other formats that use such a formats.

On the other hand, every simplification that we introduce into authoring makes the parsing much harder. With the fixed format that is currently given, implementing a parser is really trivial. Allowing for all the exceptions and authoring errors will give us all sorts of edge cases. For example, in SRT 01:01.5 is actually interpreted by some players as 01:01.005 and by others as 01:01.500 . We'd have to introduce rules on what these things actually mean and then implement more complex parsers.

Comment 14 Philip Jägenstedt 2011-09-28 08:27:22 UTC

(In reply to comment #12)
> How many lines do you want to parse before expecting a line with "-->" in it?
> This has the potential danger that if somebody really screwed up their WEBVTT
> file and it's a very long file that we end up parsing the whole file before we
> notice that there aren't any cues. Is that desirable?

The current parser only ever aborts parsing in the header, never while parsing cues, so this is already the case. (Step 31 looks like it might, but it can only happen at EOS.) I think this is very much desirable, as the alternative is to either die on the first error or have an arbitrary limit for how many non-cue lines we tolerate before dying.

Comment 15 Philip Jägenstedt 2011-09-28 08:39:51 UTC

(In reply to comment #13)
> (In reply to comment #11)
> > I agree with Philip. The parser shouldn't drocanianly drop cues for trivial
> > authoring mistakes. I don't know if we need to polish the parsing of the id
> > (though I don't mind that), but certainly the timestamp parsing needs
> > polishing. From what I remember when looking at SRT content, it's not uncommon
> > to have various mistakes in the timestamp. Usually you don't notice the error
> > (until you validate the file or check a browser's error console). The parsing
> > of timestamps should be DWIM (doesn't need to be compatible with SRT
> > implementations though).
> > 
> > 1:01.000 = 01:01.000
> > 01:1.000 = 01:01.000
> > 01:01,000 = 01:01.000
> > 01:01.5 = 01:01.500
> > 01:01.5000 = 01:01.500
> 
> Maybe.
> 
> > 01:61.000 = 02:01.000
> 
> It would be bad if we accepted this kind of mistake IMHO. Think e.g. about
> 01:5432153.000 - that's neither readable nor makes it much sense at all as a
> time format.

This example actually came my OVC demo, where <http://people.opera.com/philipj/2011/09/ovc/demos/the_conceited_general.vtt> has this cue:

02:59.000 --> 02:61.000
<v General><c.sound>grunt

This was a mistake, but I didn't notice because I didn't watch the entire video after the final editing.

Whether or not we allow >2 digits is another matter, but 01:5432153.000 doesn't really seem more problematic than the hours, which are not limited.

> I'm in two minds about this.
> 
> On the one hand, allowing simpler time formats (such as just
> seconds.milliseconds) would be a nice simplification to allow and makes it
> easier to convert from other formats that use such a formats.

I'm not really opposed to making minutes optional, but that isn't what is being suggested here.

> On the other hand, every simplification that we introduce into authoring makes
> the parsing much harder. With the fixed format that is currently given,
> implementing a parser is really trivial. Allowing for all the exceptions and
> authoring errors will give us all sorts of edge cases. For example, in SRT
> 01:01.5 is actually interpreted by some players as 01:01.005 and by others as
> 01:01.500 . We'd have to introduce rules on what these things actually mean and
> then implement more complex parsers.

Sure, but as implementors we are quite willing to deal with this if it makes the format more usable.

Comment 16 Simon Pieters 2011-09-29 16:07:00 UTC

(In reply to comment #13)
> I'm in two minds about this.
> 
> On the one hand, allowing simpler time formats (such as just
> seconds.milliseconds) would be a nice simplification to allow and makes it
> easier to convert from other formats that use such a formats.
> 
> On the other hand, every simplification that we introduce into authoring makes
> the parsing much harder.

Not really. The parser now has to check for errors. If we make the parser more lenient, it would involve removing the error checks, which actually makes the parser *simpler*.

> With the fixed format that is currently given,
> implementing a parser is really trivial. Allowing for all the exceptions and
> authoring errors will give us all sorts of edge cases.

We already have to test for the same edge cases...

> For example, in SRT
> 01:01.5 is actually interpreted by some players as 01:01.005 and by others as
> 01:01.500 .

Yes. So?

> We'd have to introduce rules on what these things actually mean

Of course. We already do. Right now it means "skip the cue".

> and
> then implement more complex parsers.

I think it wouldn't be more complex.

Comment 17 Ian 'Hixie' Hickson 2011-10-01 00:11:45 UTC

(In reply to comment #11)
> I agree with Philip. The parser shouldn't droconianly drop cues for trivial
> authoring mistakes.

The error handling model here is the same as CSS's error handling model, which is the least draconian model on the Web platform, and the exact opposite of XML's, which people normally are referring to when they talk about draconian error handling. Draconian error handling is throwing the entire document away when you find one error. The handling we have here is forward-compatible dropping of the self-contained unit of markup that is syntactically incorrect.

I strongly disagree that there is value in making the language DWIMmy. So far on the Web, of the "bail on error", "ignore on error", and "try to work around the error" models, the first has been found to be impractical (XML), the last has been found to be unwieldly (HTML), and the second has been found to strike the perfect balance between ease of use, simplicity, ease of implementation, forward- compatibility, and consistency. We should not ignore the lessons we have learnt with HTML.

We can never reliably catch authoring mistakes ("02.000" vs "20.000" both look valid but one is wrong) nor reliably fix them (maybe "002.000" is a typo for "02.000" or maybe it's a typo for "00.200" or maybe it's a typo for "20.000"). We can detect syntactic errors, and we can report them

Comment 18 Philip Jägenstedt 2011-10-04 08:04:24 UTC

I no longer have the data I used for <http://blog.foolip.org/2010/08/20/srt-research/> (or I would already have quoted it) but I've gotten a new batch of 65k SRT files from opensubtitles.org and zcorpan is going to do some analysis of authoring errors on that. We'll open new bugs for each specific issue.