This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23927 - ASCII-incompatible encoder error handling
Summary: ASCII-incompatible encoder error handling
Status: RESOLVED FIXED
Alias: None
Product: WHATWG
Classification: Unclassified
Component: Encoding (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: Unsorted
Assignee: Anne
QA Contact: sideshowbarker+encodingspec
URL:
Whiteboard:
Keywords:
: 23926 (view as bug list)
Depends on: 16688
Blocks:
  Show dependency treegraph
 
Reported: 2013-11-26 15:45 UTC by Simon Sapin
Modified: 2014-03-26 18:36 UTC (History)
3 users (show)

See Also:


Attachments

Description Simon Sapin 2013-11-26 15:45:02 UTC
http://encoding.spec.whatwg.org/#encodings

[[
Otherwise, if encoder's error handling mode is URL, emit byte 0x3F.

Otherwise, emit the result of running utf-8 encode on U+0026, U+0023, followed by the shortest sequence of ASCII digits representing c in base ten, followed by U+003B.
]]

Is it intentional to emit bytes for the ASCII representation of `?` or `&#nnn;`, even if the encoding being used is not ASCII-compatible?

rust-encoding’s current implementation instead uses the current encoder to encode `?` or `&#nnn;` to bytes, and aborts if that fails (which I’m not convinced can ever happen, even in weird non-web encodings that this implementation supports.)

If this is intentional, I’ll file a bug on rust-encoding.
Comment 1 Simon Sapin 2013-11-26 15:46:52 UTC
*** Bug 23926 has been marked as a duplicate of this bug. ***
Comment 2 Anne 2013-11-26 15:57:15 UTC
Example? Is an encoding not switching modes correctly?
Comment 3 Simon Sapin 2013-11-26 16:14:04 UTC
Although this "should not happen", the UTF 16 encoder is specified to emit an error for surrogate code points in the input:

http://encoding.spec.whatwg.org/#utf-16-encoder
Comment 4 Addison Phillips 2013-11-26 16:33:39 UTC
If the mode is URL, emitting 0x3F might make some sense. Normally, though, a utf-16-encoder would emit U+FFFD when it errors in this way. I think I would prefer if the resulting UTF-16 actually had U+FFFD instead of 0x3F (and actually, if this is a UTF-16 *encoder*, emitting the single byte 0x3F would result in the string not be valid UTF-16).

Emitting an HTML entity makes sense when encoding HTML text (the resulting isolated surrogate code point still shows in the output, but the text is now validly UTF-16).
Comment 5 Anne 2013-11-26 17:47:35 UTC
I don't think you can even get to the utf-16 encoder from the web platform stack. You'll end up using utf-8 instead. And it's not entirely clear to me if the utf-16 encoder should deal with non-Unicode-scalar-value input.
Comment 6 Addison Phillips 2013-11-26 18:06:11 UTC
You're probably right about not being able to get to the UTF-16 encoder directly. I'm trying to think of cases and the only one that occurs to me out of hand would be reading data into a JS string? Or maybe writing an XML document (**NOT** XHTML, please note).

A UTF-16 encoder should deal with non-Unicode-scalar-value input: that is one of its edge conditions. Bad data exists everywhere and the failure conditions should be well-described. It's easy enough to chop a UTF-16 buffer between two surrogate code points (if your code is surrogate stupid). Similarly someone might use it as a form of attack ("?" has a meaning in syntaxes such as URL but U+D800 might look like a tofu box and not arouse suspicion).

In any case, don't you agree that the "error" instructions are for ASCII-compatible encodings and, as written, aren't quite right for a UTF-16 encoder? If you changed the word "byte" to "code unit", that might fix it (at the cost of confusion for all other encodings).
Comment 7 Anne 2013-11-28 12:38:25 UTC
Well, the question is whether the encoder needs to deal with lone surrogates or whether lone surrogates need to be handled before the encoder is invoked. I guess I could see the former make sense, but that would mean we need some special rules for utf-8 and utf-16 as they should always emit the byte sequence for U+FFFD for lone surrogates and never anything else.
Comment 8 Anne 2013-12-12 16:08:09 UTC
I think the correct fix here is for the encoder error algorithm to push code points on the stream that is being converted.

Currently state switching for iso-2022-jp and such does not happen correctly either.

Fixing bug 16688 would make this easier I think.
Comment 9 Anne 2014-03-26 18:36:13 UTC
Fixed as part of bug 16688.

https://github.com/whatwg/encoding/commit/dc8e4c10c9b4a91f188f3145c2e31ddec4d52a78

This is a massive change, review appreciated!