This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 17839 - WebVTT: Update WebVTT to make use of the Encoding Standard
Summary: WebVTT: Update WebVTT to make use of the Encoding Standard
Alias: None
Product: TextTracks CG
Classification: Unclassified
Component: WebVTT (show other bugs)
Version: unspecified
Hardware: Other other
: P3 normal
Target Milestone: ---
Assignee: Silvia Pfeiffer
QA Contact: This bug has no owner yet - up for the taking
Depends on: 15332
  Show dependency treegraph
Reported: 2012-07-18 07:00 UTC by contributor
Modified: 2013-07-12 02:05 UTC (History)
11 users (show)

See Also:


Description contributor 2012-07-18 07:00:21 UTC
This was was cloned from bug 16768 as part of operation convergence.
Originally filed: 2012-04-18 07:54:00 +0000
Original reporter: Anne <>

 #0   Anne                                            2012-04-18 07:54:33 +0000 
The IANA registry is unbounded, does not match implementations when it comes to encodings and their labels, does not detail extensions to encodings that need to be supported, does not detail error handling for encodings; it is inadequate per today's standards. was written to solve this problem and using it in HTML we can simplify the following:

* Instead of "preferred MIME name" we can now talk about "name" of the "encoding".
* "ASCII-compatible character encoding" is no longer needed as only utf-16 and utf-16be are incompatible per the restricted list.
* The "decode a byte string as UTF-8, with error handling" algorithm can be removed in favor of using "utf-8 decode" which has the correct error handling (should be identical).
* For encoding (URLs and <form>) a custom "encoder error" needs to be defined, by returning from the decoder algorithm and feeding it the intended replacement characters. (You do not know in advance which code points cannot be encoded.)
* In the suggested default encoding list the encoding names can be updated to use the canonical name rather than a label.
* Misinterpreted for compatibility is no longer needed and the encoding overrides table can also be removed.
 #1   Jirka Kosek                                     2012-04-18 11:05:14 +0000 
Hi Anne,

thanks for draft.

Where are labels coming from? I'm asking because if the aim of spec is to handle legacy content then additional labels should be added. For example windows-1250 was sometimes referred as cp1250 and you will find plenty of such pages in the wild.

 #2   Anne                                            2012-04-18 11:31:55 +0000 
The current draft is indeed rather conservative when it comes to single-byte labels (IE is the only browser that does not recognize that label as far as I can tell). I filed bug 16773 to change that.
 #3   Anne                                            2012-05-23 07:52:49 +0000 
*** Bug 17151 has been marked as a duplicate of this bug. ***
Comment 1 Ian 'Hixie' Hickson 2013-01-24 01:38:45 UTC
Do you flag people using bytes that aren't compatible between ISO-8859-1 and Win1252 as a conformance error anywhere, or are we just saying ISO-8859-1 is bogus and these are the new tables, end of story?

I've left references to "ASCII-compatible character encoding" for now; is it not still plausible that people are using EBCDIC mainframes and implementing HTML parsers for them?

The "utf-8 decode" and "decode" algorithms are too clever for HTML's use, so I just directly use the relevant decoder algorithms. "encode" doesn't seem to add anything useful vs "encoder", either.

> (You do not know in advance which code points cannot be encoded.)

Can you elaborate on this?

This patch is kinda long and I'm not at all sure I got it all right, so if you see anything I missed don't hesitate to let me know.
Comment 2 contributor 2013-01-24 01:39:06 UTC
Checked in as WHATWG revision r7647.
Check-in comment: Embrace the Encodings specification.
Comment 3 Anne 2013-02-07 13:32:24 UTC
Basically, HTML now outlaws EBCDIC so I don't think we should account for that possibility. Just like specifications leave non-8-bit byte architectures as an exercise for the reader.

> Can you elaborate on this?

What I meant is that knowing whether you can encoding a given code point or decode a given byte requires running through an algorithm that effectively attempts that operation. There's no concept of X can encode/decode set Y.

As for utf-8 decode. I was hoping we could end up with all specifications to use the same routine and same algorithm in the backend. By having HTML use utf-8 decode (and similar) that would be encouraged and would make it completely obvious that is in fact possible. (And if we then later need to tweak something there's only one place to do it, yadayadayada.)
Comment 4 Ian 'Hixie' Hickson 2013-02-08 03:16:17 UTC
I don't think having specs ignore real problems is a good policy. I'm not at all convinced that there are no EBCDIC systems out there connected to the Web. If it's true that EBCDIC is dead, then great, but if it's only almost dead like XML, then we should still cater for it (like we do with XML).

Leaving open to see if I can move the BOM handling more to the encoding spec.
Comment 5 Ian 'Hixie' Hickson 2013-03-29 18:45:22 UTC
I've attempted to fix this for HTML, but WebVTT still needs fixing.
Comment 6 contributor 2013-03-29 18:45:39 UTC
Checked in as WHATWG revision r7782.
Check-in comment: Strip a leading BOM from scripts in workers, if any. Also, use more of the encoding spec.
Comment 7 Silvia Pfeiffer 2013-04-11 04:35:53 UTC
I assume it should go into the new version of the WebVTT spec.
So, just checking what needs to be changed.

Replace basically "<span>decoded as UTF-8, with error handling</span>" with "decoded using the <span>UTF-8 decoder</span>"?
Comment 8 Silvia Pfeiffer 2013-04-11 04:40:23 UTC
Hmm also probably: remove
   <li><p>If the character indicated by <var title="">position</var>
   is a U+FEFF BYTE ORDER MARK (BOM) character, advance <var
   title="">position</var> to the next character in <var
Comment 9 Anne 2013-04-11 05:44:20 UTC
Sylvia, yes, but you want to use rather than the utf-8 decoder. And then you can indeed remove the step about the BOM.
Comment 10 Silvia Pfeiffer 2013-04-11 05:48:59 UTC
(In reply to comment #9)
> Sylvia, yes, but you want to use
> rather than the utf-8 decoder.
> And then you can indeed remove the step about the BOM.

Yes, that's what I meant. :-) Thanks!
Comment 11 Silvia Pfeiffer 2013-04-11 05:57:55 UTC
I am confused by the note in the WHATWG spec: "The UTF-8 decoder is distinct from the UTF-8 decode algorithm. The latter first strips a Byte Order Mark (BOM), if any, and then invokes the former."

If I remove the BOM paragraph, I should then reference the "UTF-8 decode algorithm", right?
Comment 12 Anne 2013-04-11 06:00:31 UTC
Comment 13 Silvia Pfeiffer 2013-04-11 06:03:48 UTC
Good. Here we go:
Comment 14 Silvia Pfeiffer 2013-07-12 02:05:23 UTC
Patch was applied as prepared.