22315 – [WebVTT] Should probably allow any HTML5 space character after signature, for consistency

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 22315 - [WebVTT] Should probably allow any HTML5 space character after signature, for consistency

Summary: [WebVTT] Should probably allow any HTML5 space character after signature, for...

Status:	RESOLVED MOVED

Alias:	None

Product:	TextTracks CG
Classification:	Unclassified
Component:	WebVTT (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Philip Jägenstedt
QA Contact:	This bug has no owner yet - up for the taking

URL:
Whiteboard:	v1, see comment 3
Keywords:

Depends on:
Blocks:

Reported:	2013-06-09 16:09 UTC by Caitlin Potter (:caitp)
Modified:	2015-11-09 16:04 UTC (History)
CC List:	6 users (show)

See Also:

Attachments

Description Caitlin Potter (:caitp) 2013-06-09 16:09:39 UTC

"If line is more than six characters long but the first six characters do not exactly equal "WEBVTT", or the seventh character is neither a U+0020 SPACE character nor a U+0009 CHARACTER TABULATION (tab) character, then abort these steps. The file does not start with the correct WebVTT file signature and was therefore not successfully processed."

In other areas of the spec (anywhere where the "skip whitespace" step is mentioned), the FORM FEED character is one of the acceptable characters.

For consistency with the rest of the document, it might be a good idea to say "if the seventh character is not a space character (http://www.w3.org/html/wg/drafts/html/master/single-page.html#space-character), then abort these steps."

Further, but sort of unrelated, it may be unwise to ask the parser to collect an entire line for this first step -- as if the document is not in fact a valid WEBVTT document, it could potentially be gigabytes collected before a byte 0x0A or 0x0D is encountered -- which means that for safety clients impose a maximum line length which is not defined in the standard, just to avoid going crazy trying to read a garbage document.

Basically, sniffing the document could be done somewhat more cleverly.

Comment 1 Silvia Pfeiffer 2013-06-11 05:04:55 UTC

(In reply to comment #0)
> For consistency with the rest of the document, it might be a good idea to
> say "if the seventh character is not a space character
> (http://www.w3.org/html/wg/drafts/html/master/single-page.html#space-
> character), then abort these steps."

Makes sense.


> Further, but sort of unrelated, it may be unwise to ask the parser to
> collect an entire line for this first step -- as if the document is not in
> fact a valid WEBVTT document, it could potentially be gigabytes collected
> before a byte 0x0A or 0x0D is encountered -- which means that for safety
> clients impose a maximum line length which is not defined in the standard,
> just to avoid going crazy trying to read a garbage document.

That's a quality of implementation issue. Just pick a limit in your implementation.

Comment 2 Caitlin Potter (:caitp) 2013-06-11 09:51:38 UTC

>That's a quality of implementation issue. Just pick a limit in your >implementation.
I agree that this is a quality implementation for most of the other areas where a line needs to be read before further processing.

But I think that for sniffing the file signature, it doesn't really make sense to wait for an entire line to be read -- because there is a hard limit on what the first legal characters may be (unless this changes at some point). I feel this sort of precludes the need for reading a full line first for the very beginning of the file.

Comment 3 Simon Pieters 2013-06-17 07:18:08 UTC

IIRC not including FF is on purpose to make it slightly simpler to sniff.

I agree that the algorithm doesn't need to wait for the full line.

Comment 4 Philip Jägenstedt 2014-01-30 19:55:45 UTC

(In reply to Simon Pieters from comment #3)
> IIRC not including FF is on purpose to make it slightly simpler to sniff.

Yeah, at the very least the signatures in the IANA section do not include form feed characters. I don't see a compelling reason to accept form feed characters here, but also don't think it would be harmful.

> I agree that the algorithm doesn't need to wait for the full line.

Both Presto and Blink actually do collect a whole line here, and it seems nice to fail fast here. A fix would be first checking if the first 6 characters are "WEBVTT" and the 7th a space or tab, then just consuming (ignoring) everything up to the first linebreak.

Fredrik, does that sound sensible, given that it breaks the line reader abstraction?

Comment 5 Fredrik S 2014-01-31 07:52:42 UTC

I guess it isn't completely nonsensible, even if it does complicate the implementation somewhat (needs a additional FSM for the "header" essentially).

Comment 6 Simon Pieters 2015-11-09 16:04:55 UTC

https://github.com/w3c/webvtt/pull/247