19938 – Number of decoder errors emitted by the UTF-8 decoder for incomplete/invalid sequences

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19938 - Number of decoder errors emitted by the UTF-8 decoder for incomplete/invalid sequences

Summary: Number of decoder errors emitted by the UTF-8 decoder for incomplete/invalid ...

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-11 22:01 UTC by pub-w3
Modified:	2012-11-16 13:14 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description pub-w3 2012-11-11 22:01:22 UTC

In a UTF-8-based system, the UTF-8 decoder only has to detect invalid bytes and byte sequences.  A natural way of doing that would be to check that the current byte is the first byte of a valid sequence (as defined in Unicode 6.1, Table 3.7 [*]) and otherwise emit a decoder error and move on to the next byte.

Unfortunately, the specification currently seems to require that only one decoder error be emitted for incomplete sequences, overlong sequences, sequences encoding surrogate characters and 4-byte sequences corresponding to characters beyond U+10FFFF.

Since 5-byte and 6-byte sequences are no longer recognised at all and result in 5 or 6 decoder errors, perhaps it would make sense at least to allow UTF-8 decoders to generate 2, 3 or 4 decoder errors when invalid/incomplete sequences of 2, 3 or 4 bytes are seen.

In other words, it would be good if a UTF-8 decoder were allowed to follow the simple strategy of always emitting one decoder error for every byte that is not part of a valid and complete byte sequence.


[*] <http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf>, p. 95.

Comment 1 Anne 2012-11-11 22:06:25 UTC

No, the whole point of this specification is to remove choice. And the constraints for the utf-8 decoder were carefully considered.

I guess it would make sense to point out that we subset Unicode here.

Comment 2 pub-w3 2012-11-12 20:08:32 UTC

One way of allowing it would be to mandate it, of course.  Unfortunately, it seems difficult to avoid the current UTF-16 bias without ending up at the other extreme, requiring UTF-16-based systems to do a little bit of extra work to get the right number of decoder errors in certain cases, even if the alternative solution may seem simpler in principle and more agnostic.

Within the current approach, I might have preferred 1 decoder error rather than 2 for 2-byte overlong sequences (0xC0 or 0xC1 followed by a continuation byte) for consistency with 3-byte and 4-byte ones, which would not add any complexity to the algorithm, but I do of course realise that there are reasonable arguments to be made on either side.

Feel free to bury the dead horse.

Comment 3 Anne 2012-11-12 22:00:56 UTC

I discussed this before. And the feeling was that C0 / C1 and also the 5/6 sequence lead bytes are non-existent even though they were acknowledged to exist at some point. And that therefore catering towards them is wrong.

And mandating the one error per incorrect byte was not favored either by implementors.

Leaving open for the note.

Comment 4 Anne 2012-11-13 10:25:52 UTC

So I just read the Unicode standard. Am I correct in that the way the Encoding Standard is written matches what they call "Best Practices for Using U+FFFD"?

Comment 5 pub-w3 2012-11-14 18:54:33 UTC

Just for the record (and I was actually surprised by this), current implementations paint a rather different picture of implementers’ preferences:

Firefox follows Markus Kuhn’s traditional notion of ‘malformed sequence’ [*], recognising well-formed 5/6-byte sequences as well as ones starting with C0/C1 as a single error.

Safari, Chrome and IE all follow the alternative approach of ‘represent[ing] each individual byte of a malformed sequence by a replacement character’, ‘a perfectly acceptable (and in some situations even preferable) solution’ (ibid.).

Only Opera implements the more novel approach currently found in the Encoding Specification.

[*] <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>


(Incidentally, modified UTF-8, used in Java, encodes 0 as C0 80.)

Comment 6 Anne 2012-11-14 19:07:07 UTC

Yeah, the plan is to align with Opera/Gecko and have Gecko fix their decoder to no longer handle bytes deemed invalid long ago by Unicode.

Comment 7 pub-w3 2012-11-14 22:55:35 UTC

The plan sounds a bit Opera-centric.  ;-)

Unicode’s Best Practices for Using U+FFFD [*] appears to suggest a different algorithm:  for instance, the sequence F4 90 80 80 would give rise to three replacement characters (one for F4 90 and one for each 80) as opposed to one according to the current Encoding Standard draft.

[*] <http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf>, §5.22 (pp. 185–186).

Comment 8 Masatoshi Kimura 2012-11-14 23:07:19 UTC

(In reply to comment #7)
> Unicode’s Best Practices for Using U+FFFD [*] appears to suggest a different
> algorithm:  for instance, the sequence F4 90 80 80 would give rise to three
> replacement characters (one for F4 90 and one for each 80) as opposed to one
> according to the current Encoding Standard draft.
<F4 90 80 80> should be converted into four replacement characters, not three. The maximal subpart is not <F4 90> but <F4>. <F4 90> cannot be an initial subsequence of any valid sequences.

Comment 9 pub-w3 2012-11-14 23:20:55 UTC

Yes, you are right.

Comment 10 Masatoshi Kimura 2012-11-14 23:27:07 UTC

"utf-8 bytes needed" will be usable to determine the number of decoder errors when code point is not in the range lower boundary to 0x10FFFF or is in the range 0xD800 to 0xDFFF.

Comment 11 Anne 2012-11-15 14:30:53 UTC

Oh I see. That seems quite ugly though. Do we really want that?

Comment 12 pub-w3 2012-11-15 20:31:36 UTC

The discussion is probably a bit too abstract and philosophical at the moment.  I am inclined to believe the Unicode Consortium’s assertion that the approach ‘sounds complicated, but [...] reflects the way optimized conversion processes are typically constructed’, and that it is possible to write an efficient and fairly neat algorithm that naturally produces the prescribed number of decoding errors, but I think someone will have to write that algorithm before we can assess its merits.

Comment 13 Masatoshi Kimura 2012-11-15 21:11:28 UTC

This change should work.

Relpace the first sentence of "utf-8" section with:
The utf-8 code point, utf-8 bytes seen, utf-8 bytes needed, concepts are all initially 0. The initial lower boundary is 0x80 and initial upper boundary is 0xBF.

Replace step 5 with:
 5. If utf-8 bytes needed is 0, based on byte:
    0x00 to 0x7F
        Emit a code point whose value is byte. 
    0xC2 to 0xDF
        Set utf-8 bytes needed to 1 and utf-8 code point to byte − 0xC0. 
    0xE0 to 0xEF
     1. If byte is 0xE0, set lower boundary to 0xA0.
     2. If byte is 0xED, set upper boundary to 0x9F.
     3. Set utf-8 bytes needed to 2 and utf-8 code point to byte − 0xE0. 
    0xF0 to 0xF4
     1. If byte is 0xF0, set lower boundary to 0x90.
     2. If byte is 0xF4, set upper boundary to 0x8F.
     3. Set utf-8 bytes needed to 3 and utf-8 code point to byte − 0xF0. 
    Otherwise
        Emit a decoder error. 

    Then (byte is in the range 0xC2 to 0xF4) set utf-8 code point to utf-8 code point × 64utf-8 bytes needed and continue.

Replace step 6 with:
 6. If byte is not in the range lower boundary to upper boundary, run these substeps:
     1. Set utf-8 code point, utf-8 bytes needed, and utf-8 bytes seen to 0. Set lower boundary to 0x80 and upper boundary to 0xBF.
     2. Decrease the byte pointer by one.
     3. Emit a decoder error. 

Add the following step after step 6 and shift the following step numbers by one.
 7. Set lower boundary to 0x80 and upper boundary to 0xBF. 

Replace step 10 (former step 9) with:
10. Let code point be utf-8 code point.

Replace step 11 (former step 10) with:
11. Set utf-8 code point, utf-8 bytes needed, and utf-8 bytes seen to 0.

Replace step 12 (former step 11) with:
12. Emit a code point whose value is code point.

Remove step 13 (former step 12).

Comment 14 Anne 2012-11-16 13:14:32 UTC

Thanks guys!

https://github.com/whatwg/encoding/commit/ec77e51d5b808a2a3386e2d39f14a1b0f8d58635