24104 – Clarify how encoders should deal with lone surrogates

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24104 - Clarify how encoders should deal with lone surrogates

Summary: Clarify how encoders should deal with lone surrogates

Status:	RESOLVED FIXED

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	Encoding (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	Unsorted
Assignee:	Anne
QA Contact:	sideshowbarker+encodingspec

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-12-15 20:27 UTC by Mathias Bynens
Modified:	2014-04-11 11:58 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Mathias Bynens 2013-12-15 20:27:51 UTC

Apparently the intent is to allow only scalar values and error on lone surrogates:

http://krijnhoetmer.nl/irc-logs/whatwg/20131214#l-500
http://krijnhoetmer.nl/irc-logs/whatwg/20131215#l-221

Comment 1 Anne 2013-12-16 01:05:20 UTC

http://lists.w3.org/Archives/Public/public-whatwg-archive/2013Sep/0020.html

Comment 2 Anne 2014-03-28 11:52:34 UTC

I tested this:

<meta charset=windows-1252>
<form action=http://software.hixie.ch/utilities/cgi/test-tools/echo>
<input name=a> <script> document.querySelector("input").value = "\ud801" </script>
<input type=submit>
</form>

Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per <form> error mode as windows-1252 can express neither).

Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8 bytes percent-encoded).

utf-16 results in the same as utf-8 as expected.

So either each encoder's handler needs to catch the surrogate range and return error with U+FFFD (Gecko) or not (Chrome). Gecko's behavior is slightly saner I suspect. I'll fix utf-8 and utf-16 to do this right away. Not sure who to consult how we should change the rest.

Comment 3 Anne 2014-03-28 12:01:05 UTC

I analyzed too quickly. In Gecko and Chrome is either lone surrogates never reach the utf-8 encoder (replaced by U+FFFD before) or are replaced as part of the encoder. They do not result in an error as that would cause something in the form of &#...; to be emitted rather than a straight U+FFFD.

Boris, Henri, Simon, do you have any preferences how we arrange the encoder setup? Should all encoders replace lone surrogates in the input stream with U+FFFD or should we make encoders only take Unicode scalar values and let a layer before handle the lone surrogates?

It seems more pragmatic to have encoders take code points. Maybe I should introduce a special lone surrogate error that does the replacing to U+FFFD?

Comment 4 Simon Pieters 2014-03-28 13:18:57 UTC

No opinion

Comment 5 Boris Zbarsky 2014-03-28 15:48:24 UTC

Are we talking about encoders generally or the specific case of form submission?

Comment 6 Anne 2014-03-28 15:58:56 UTC

Generally. But it affects form submission and URLs of course.

It seems Unicode has the contract as a mapping of Unicode scalar values (code points minus surrogates) to bytes and vice versa. That seems reasonable to me but does mean that everyone using encoders/decoders has to convert their code point sequence to a Unicode scalar value sequence first.

Comment 7 Boris Zbarsky 2014-03-28 16:33:47 UTC

So this is just about which exact layer does the lone surrogate replacement with U+FFFD; black-box the resulting behavior is the same?

Comment 8 Anne 2014-03-28 18:05:34 UTC

Well, as shown in comment 2 currently the behavior is different for encodings other than utf-8 and utf-16le/be. Chrome will emit lone surrogates escaped (meaning its encoders take code points) where as Firefox emits lone surrogates as U+FFFD escaped.

Other than that it is mostly a layer and debugging question I suppose, yes, but also affects whether e.g. IDL needs [EnsureUTF16] or some such or not.

Comment 9 Anne 2014-04-11 11:58:59 UTC

https://github.com/whatwg/encoding/commit/4abe74d1400c5ab8913c5f229b59b237ae5aac51