23927 – ASCII-incompatible encoder error handling

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23927 - ASCII-incompatible encoder error handling

Summary: ASCII-incompatible encoder error handling

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Duplicates (1):	23926 (view as bug list)
Depends on:	16688
Blocks:
	Show dependency tree / graph

Reported:	2013-11-26 15:45 UTC by Simon Sapin
Modified:	2014-03-26 18:36 UTC (History)
CC List:	3 users (show)

See Also:

Attachments

Description Simon Sapin 2013-11-26 15:45:02 UTC

http://encoding.spec.whatwg.org/#encodings

[[
Otherwise, if encoder's error handling mode is URL, emit byte 0x3F.

Otherwise, emit the result of running utf-8 encode on U+0026, U+0023, followed by the shortest sequence of ASCII digits representing c in base ten, followed by U+003B.
]]

Is it intentional to emit bytes for the ASCII representation of `?` or `&#nnn;`, even if the encoding being used is not ASCII-compatible?

rust-encoding’s current implementation instead uses the current encoder to encode `?` or `&#nnn;` to bytes, and aborts if that fails (which I’m not convinced can ever happen, even in weird non-web encodings that this implementation supports.)

If this is intentional, I’ll file a bug on rust-encoding.

Comment 1 Simon Sapin 2013-11-26 15:46:52 UTC

*** Bug 23926 has been marked as a duplicate of this bug. ***

Comment 2 Anne 2013-11-26 15:57:15 UTC

Example? Is an encoding not switching modes correctly?

Comment 3 Simon Sapin 2013-11-26 16:14:04 UTC

Although this "should not happen", the UTF 16 encoder is specified to emit an error for surrogate code points in the input:

http://encoding.spec.whatwg.org/#utf-16-encoder

Comment 4 Addison Phillips 2013-11-26 16:33:39 UTC

If the mode is URL, emitting 0x3F might make some sense. Normally, though, a utf-16-encoder would emit U+FFFD when it errors in this way. I think I would prefer if the resulting UTF-16 actually had U+FFFD instead of 0x3F (and actually, if this is a UTF-16 *encoder*, emitting the single byte 0x3F would result in the string not be valid UTF-16).

Emitting an HTML entity makes sense when encoding HTML text (the resulting isolated surrogate code point still shows in the output, but the text is now validly UTF-16).

Comment 5 Anne 2013-11-26 17:47:35 UTC

I don't think you can even get to the utf-16 encoder from the web platform stack. You'll end up using utf-8 instead. And it's not entirely clear to me if the utf-16 encoder should deal with non-Unicode-scalar-value input.

Comment 6 Addison Phillips 2013-11-26 18:06:11 UTC

You're probably right about not being able to get to the UTF-16 encoder directly. I'm trying to think of cases and the only one that occurs to me out of hand would be reading data into a JS string? Or maybe writing an XML document (**NOT** XHTML, please note).

A UTF-16 encoder should deal with non-Unicode-scalar-value input: that is one of its edge conditions. Bad data exists everywhere and the failure conditions should be well-described. It's easy enough to chop a UTF-16 buffer between two surrogate code points (if your code is surrogate stupid). Similarly someone might use it as a form of attack ("?" has a meaning in syntaxes such as URL but U+D800 might look like a tofu box and not arouse suspicion).

In any case, don't you agree that the "error" instructions are for ASCII-compatible encodings and, as written, aren't quite right for a UTF-16 encoder? If you changed the word "byte" to "code unit", that might fix it (at the cost of confusion for all other encodings).

Comment 7 Anne 2013-11-28 12:38:25 UTC

Well, the question is whether the encoder needs to deal with lone surrogates or whether lone surrogates need to be handled before the encoder is invoked. I guess I could see the former make sense, but that would mean we need some special rules for utf-8 and utf-16 as they should always emit the byte sequence for U+FFFD for lone surrogates and never anything else.

Comment 8 Anne 2013-12-12 16:08:09 UTC

I think the correct fix here is for the encoder error algorithm to push code points on the stream that is being converted.

Currently state switching for iso-2022-jp and such does not happen correctly either.

Fixing bug 16688 would make this easier I think.

Comment 9 Anne 2014-03-26 18:36:13 UTC

Fixed as part of bug 16688.

https://github.com/whatwg/encoding/commit/dc8e4c10c9b4a91f188f3145c2e31ddec4d52a78

This is a massive change, review appreciated!