<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>24104</bug_id>
          
          <creation_ts>2013-12-15 20:27:51 +0000</creation_ts>
          <short_desc>Clarify how encoders should deal with lone surrogates</short_desc>
          <delta_ts>2014-04-11 11:58:59 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Mathias Bynens">mathias</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>bzbarsky</cc>
    
    <cc>hsivonen</cc>
    
    <cc>mathias</cc>
    
    <cc>mike</cc>
    
    <cc>simon.sapin</cc>
    
    <cc>www-international</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>97632</commentid>
    <comment_count>0</comment_count>
    <who name="Mathias Bynens">mathias</who>
    <bug_when>2013-12-15 20:27:51 +0000</bug_when>
    <thetext>Apparently the intent is to allow only scalar values and error on lone surrogates:

http://krijnhoetmer.nl/irc-logs/whatwg/20131214#l-500
http://krijnhoetmer.nl/irc-logs/whatwg/20131215#l-221</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>97633</commentid>
    <comment_count>1</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-12-16 01:05:20 +0000</bug_when>
    <thetext>http://lists.w3.org/Archives/Public/public-whatwg-archive/2013Sep/0020.html</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103085</commentid>
    <comment_count>2</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-28 11:52:34 +0000</bug_when>
    <thetext>I tested this:

&lt;meta charset=windows-1252&gt;
&lt;form action=http://software.hixie.ch/utilities/cgi/test-tools/echo&gt;
&lt;input name=a&gt; &lt;script&gt; document.querySelector(&quot;input&quot;).value = &quot;\ud801&quot; &lt;/script&gt;
&lt;input type=submit&gt;
&lt;/form&gt;

Gecko does U+FFFD, Chrome gives back U+D801 (encoded as per &lt;form&gt; error mode as windows-1252 can express neither).

Now if set the encoding to utf-8 both Gecko and Chrome emit U+FFFD (as utf-8 bytes percent-encoded).

utf-16 results in the same as utf-8 as expected.

So either each encoder&apos;s handler needs to catch the surrogate range and return error with U+FFFD (Gecko) or not (Chrome). Gecko&apos;s behavior is slightly saner I suspect. I&apos;ll fix utf-8 and utf-16 to do this right away. Not sure who to consult how we should change the rest.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103086</commentid>
    <comment_count>3</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-28 12:01:05 +0000</bug_when>
    <thetext>I analyzed too quickly. In Gecko and Chrome is either lone surrogates never reach the utf-8 encoder (replaced by U+FFFD before) or are replaced as part of the encoder. They do not result in an error as that would cause something in the form of &amp;#...; to be emitted rather than a straight U+FFFD.

Boris, Henri, Simon, do you have any preferences how we arrange the encoder setup? Should all encoders replace lone surrogates in the input stream with U+FFFD or should we make encoders only take Unicode scalar values and let a layer before handle the lone surrogates?

It seems more pragmatic to have encoders take code points. Maybe I should introduce a special lone surrogate error that does the replacing to U+FFFD?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103096</commentid>
    <comment_count>4</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2014-03-28 13:18:57 +0000</bug_when>
    <thetext>No opinion</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103101</commentid>
    <comment_count>5</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2014-03-28 15:48:24 +0000</bug_when>
    <thetext>Are we talking about encoders generally or the specific case of form submission?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103103</commentid>
    <comment_count>6</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-28 15:58:56 +0000</bug_when>
    <thetext>Generally. But it affects form submission and URLs of course.

It seems Unicode has the contract as a mapping of Unicode scalar values (code points minus surrogates) to bytes and vice versa. That seems reasonable to me but does mean that everyone using encoders/decoders has to convert their code point sequence to a Unicode scalar value sequence first.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103105</commentid>
    <comment_count>7</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2014-03-28 16:33:47 +0000</bug_when>
    <thetext>So this is just about which exact layer does the lone surrogate replacement with U+FFFD; black-box the resulting behavior is the same?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103106</commentid>
    <comment_count>8</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-28 18:05:34 +0000</bug_when>
    <thetext>Well, as shown in comment 2 currently the behavior is different for encodings other than utf-8 and utf-16le/be. Chrome will emit lone surrogates escaped (meaning its encoders take code points) where as Firefox emits lone surrogates as U+FFFD escaped.

Other than that it is mostly a layer and debugging question I suppose, yes, but also affects whether e.g. IDL needs [EnsureUTF16] or some such or not.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>103736</commentid>
    <comment_count>9</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-04-11 11:58:59 +0000</bug_when>
    <thetext>https://github.com/whatwg/encoding/commit/4abe74d1400c5ab8913c5f229b59b237ae5aac51</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>