This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
http://encoding.spec.whatwg.org/#encodings [[ Otherwise, if encoder's error handling mode is URL, emit byte 0x3F. Otherwise, emit the result of running utf-8 encode on U+0026, U+0023, followed by the shortest sequence of ASCII digits representing c in base ten, followed by U+003B. ]] Is it intentional to emit bytes for the ASCII representation of `?` or `&#nnn;`, even if the encoding being used is not ASCII-compatible? rust-encoding’s current implementation instead uses the current encoder to encode `?` or `&#nnn;` to bytes, and aborts if that fails (which I’m not convinced can ever happen, even in weird non-web encodings that this implementation supports.) If this is intentional, I’ll file a bug on rust-encoding.
*** Bug 23926 has been marked as a duplicate of this bug. ***
Example? Is an encoding not switching modes correctly?
Although this "should not happen", the UTF 16 encoder is specified to emit an error for surrogate code points in the input: http://encoding.spec.whatwg.org/#utf-16-encoder
If the mode is URL, emitting 0x3F might make some sense. Normally, though, a utf-16-encoder would emit U+FFFD when it errors in this way. I think I would prefer if the resulting UTF-16 actually had U+FFFD instead of 0x3F (and actually, if this is a UTF-16 *encoder*, emitting the single byte 0x3F would result in the string not be valid UTF-16). Emitting an HTML entity makes sense when encoding HTML text (the resulting isolated surrogate code point still shows in the output, but the text is now validly UTF-16).
I don't think you can even get to the utf-16 encoder from the web platform stack. You'll end up using utf-8 instead. And it's not entirely clear to me if the utf-16 encoder should deal with non-Unicode-scalar-value input.
You're probably right about not being able to get to the UTF-16 encoder directly. I'm trying to think of cases and the only one that occurs to me out of hand would be reading data into a JS string? Or maybe writing an XML document (**NOT** XHTML, please note). A UTF-16 encoder should deal with non-Unicode-scalar-value input: that is one of its edge conditions. Bad data exists everywhere and the failure conditions should be well-described. It's easy enough to chop a UTF-16 buffer between two surrogate code points (if your code is surrogate stupid). Similarly someone might use it as a form of attack ("?" has a meaning in syntaxes such as URL but U+D800 might look like a tofu box and not arouse suspicion). In any case, don't you agree that the "error" instructions are for ASCII-compatible encodings and, as written, aren't quite right for a UTF-16 encoder? If you changed the word "byte" to "code unit", that might fix it (at the cost of confusion for all other encodings).
Well, the question is whether the encoder needs to deal with lone surrogates or whether lone surrogates need to be handled before the encoder is invoked. I guess I could see the former make sense, but that would mean we need some special rules for utf-8 and utf-16 as they should always emit the byte sequence for U+FFFD for lone surrogates and never anything else.
I think the correct fix here is for the encoder error algorithm to push code points on the stream that is being converted. Currently state switching for iso-2022-jp and such does not happen correctly either. Fixing bug 16688 would make this easier I think.
Fixed as part of bug 16688. https://github.com/whatwg/encoding/commit/dc8e4c10c9b4a91f188f3145c2e31ddec4d52a78 This is a massive change, review appreciated!