<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>23927</bug_id>
          
          <creation_ts>2013-11-26 15:45:02 +0000</creation_ts>
          <short_desc>ASCII-incompatible encoder error handling</short_desc>
          <delta_ts>2014-03-26 18:36:13 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>Encoding</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Linux</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          <dependson>16688</dependson>
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Simon Sapin">simon.sapin</reporter>
          <assigned_to name="Anne">annevk</assigned_to>
          <cc>addison</cc>
    
    <cc>mike</cc>
    
    <cc>www-international</cc>
          
          <qa_contact>sideshowbarker+encodingspec</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>96834</commentid>
    <comment_count>0</comment_count>
    <who name="Simon Sapin">simon.sapin</who>
    <bug_when>2013-11-26 15:45:02 +0000</bug_when>
    <thetext>http://encoding.spec.whatwg.org/#encodings

[[
Otherwise, if encoder&apos;s error handling mode is URL, emit byte 0x3F.

Otherwise, emit the result of running utf-8 encode on U+0026, U+0023, followed by the shortest sequence of ASCII digits representing c in base ten, followed by U+003B.
]]

Is it intentional to emit bytes for the ASCII representation of `?` or `&amp;#nnn;`, even if the encoding being used is not ASCII-compatible?

rust-encoding’s current implementation instead uses the current encoder to encode `?` or `&amp;#nnn;` to bytes, and aborts if that fails (which I’m not convinced can ever happen, even in weird non-web encodings that this implementation supports.)

If this is intentional, I’ll file a bug on rust-encoding.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96836</commentid>
    <comment_count>1</comment_count>
    <who name="Simon Sapin">simon.sapin</who>
    <bug_when>2013-11-26 15:46:52 +0000</bug_when>
    <thetext>*** Bug 23926 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96837</commentid>
    <comment_count>2</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-11-26 15:57:15 +0000</bug_when>
    <thetext>Example? Is an encoding not switching modes correctly?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96839</commentid>
    <comment_count>3</comment_count>
    <who name="Simon Sapin">simon.sapin</who>
    <bug_when>2013-11-26 16:14:04 +0000</bug_when>
    <thetext>Although this &quot;should not happen&quot;, the UTF 16 encoder is specified to emit an error for surrogate code points in the input:

http://encoding.spec.whatwg.org/#utf-16-encoder</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96840</commentid>
    <comment_count>4</comment_count>
    <who name="Addison Phillips">addison</who>
    <bug_when>2013-11-26 16:33:39 +0000</bug_when>
    <thetext>If the mode is URL, emitting 0x3F might make some sense. Normally, though, a utf-16-encoder would emit U+FFFD when it errors in this way. I think I would prefer if the resulting UTF-16 actually had U+FFFD instead of 0x3F (and actually, if this is a UTF-16 *encoder*, emitting the single byte 0x3F would result in the string not be valid UTF-16).

Emitting an HTML entity makes sense when encoding HTML text (the resulting isolated surrogate code point still shows in the output, but the text is now validly UTF-16).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96843</commentid>
    <comment_count>5</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-11-26 17:47:35 +0000</bug_when>
    <thetext>I don&apos;t think you can even get to the utf-16 encoder from the web platform stack. You&apos;ll end up using utf-8 instead. And it&apos;s not entirely clear to me if the utf-16 encoder should deal with non-Unicode-scalar-value input.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96845</commentid>
    <comment_count>6</comment_count>
    <who name="Addison Phillips">addison</who>
    <bug_when>2013-11-26 18:06:11 +0000</bug_when>
    <thetext>You&apos;re probably right about not being able to get to the UTF-16 encoder directly. I&apos;m trying to think of cases and the only one that occurs to me out of hand would be reading data into a JS string? Or maybe writing an XML document (**NOT** XHTML, please note).

A UTF-16 encoder should deal with non-Unicode-scalar-value input: that is one of its edge conditions. Bad data exists everywhere and the failure conditions should be well-described. It&apos;s easy enough to chop a UTF-16 buffer between two surrogate code points (if your code is surrogate stupid). Similarly someone might use it as a form of attack (&quot;?&quot; has a meaning in syntaxes such as URL but U+D800 might look like a tofu box and not arouse suspicion).

In any case, don&apos;t you agree that the &quot;error&quot; instructions are for ASCII-compatible encodings and, as written, aren&apos;t quite right for a UTF-16 encoder? If you changed the word &quot;byte&quot; to &quot;code unit&quot;, that might fix it (at the cost of confusion for all other encodings).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>96910</commentid>
    <comment_count>7</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-11-28 12:38:25 +0000</bug_when>
    <thetext>Well, the question is whether the encoder needs to deal with lone surrogates or whether lone surrogates need to be handled before the encoder is invoked. I guess I could see the former make sense, but that would mean we need some special rules for utf-8 and utf-16 as they should always emit the byte sequence for U+FFFD for lone surrogates and never anything else.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>97516</commentid>
    <comment_count>8</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2013-12-12 16:08:09 +0000</bug_when>
    <thetext>I think the correct fix here is for the encoder error algorithm to push code points on the stream that is being converted.

Currently state switching for iso-2022-jp and such does not happen correctly either.

Fixing bug 16688 would make this easier I think.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>102938</commentid>
    <comment_count>9</comment_count>
    <who name="Anne">annevk</who>
    <bug_when>2014-03-26 18:36:13 +0000</bug_when>
    <thetext>Fixed as part of bug 16688.

https://github.com/whatwg/encoding/commit/dc8e4c10c9b4a91f188f3145c2e31ddec4d52a78

This is a massive change, review appreciated!</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>