<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>19102</bug_id>
          
          <creation_ts>2012-09-28 03:27:11 +0000</creation_ts>
          <short_desc>A semicolon-less named character reference in an attribute should also be treated as a parse error.</short_desc>
          <delta_ts>2013-01-31 00:37:03 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WHATWG</product>
          <component>HTML</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://www.whatwg.org/specs/web-apps/current-work/#tokenizing-character-references</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>Unsorted</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>hsivonen</cc>
    
    <cc>ian</cc>
    
    <cc>kennyluck</cc>
    
    <cc>mike</cc>
    
    <cc>naesten</cc>
    
    <cc>qjy1111</cc>
    
    <cc>zcorpan</cc>
          
          <qa_contact>contributor</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>74708</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2012-09-28 03:27:11 +0000</bug_when>
    <thetext>Specification: http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html
Multipage: http://www.whatwg.org/C#tokenizing-character-references
Complete: http://www.whatwg.org/c#tokenizing-character-references

Comment:
A semicolon-less named character reference in an attribute should also be
treated as a parse error.

Posted from: 119.161.158.96 by kennyluck@csail.mit.edu
User agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>74710</commentid>
    <comment_count>1</comment_count>
    <who name="Kang-Hao (Kenny) Lu">kennyluck</who>
    <bug_when>2012-09-28 03:34:31 +0000</bug_when>
    <thetext>I think the spec intentionally treat this case as no error, but I propose we make it a parse error (or other author conformance violation) *temporary* before the market share of IE9 and below drops down to certain level.

The reason is simple: this, being surprising, still traps people.

Also, validator.nu (and hence Firefox source view) treats that as error, which I believe to be a good thing for authors. See, for example, http://validator.nu/?doc=data%3Atext%2Fhtml%2C%3C!DOCTYPE+html%3E%3Ctitle%3E%3C%2Ftitle%3E%3Ca+href%3D%22a%3D1%26reg%3D2%22%3E%3C%2Fa%3E</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>76335</commentid>
    <comment_count>2</comment_count>
    <who name="Samuel Bronson">naesten</who>
    <bug_when>2012-10-15 19:47:58 +0000</bug_when>
    <thetext>If this case is intentionally not a parse error, this should be made explicit, like in the case marked: &quot;Not a character reference. No characters are consumed, and nothing is returned. (This is not an error, either.)&quot;

But I don&apos;t see why this shouldn&apos;t be an error: that doesn&apos;t appear to mean anything more than &quot;documents may not do this, validators have to warn if they do, and it&apos;s okay for implementations to give up on seeing it.&quot;  Only the license to give up could actually lead to breakage, and the big browsers won&apos;t actually do this until it&apos;s about time to excise this case, will they?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>76787</commentid>
    <comment_count>3</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2012-10-19 22:58:48 +0000</bug_when>
    <thetext>I&apos;m confused. Can you give a precise example of what you think should be invalid but isn&apos;t?

If you mean this:

   &lt;a href=&quot;a=1&amp;reg=2&quot;&gt;&lt;/a&gt;

...then that is intentionally conforming, because it&apos;s very common and harmless.

(Looks like validator.nu isn&apos;t up to date on that.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>76802</commentid>
    <comment_count>4</comment_count>
    <who name="Kang-Hao (Kenny) Lu">kennyluck</who>
    <bug_when>2012-10-20 00:50:51 +0000</bug_when>
    <thetext>(In reply to comment #3)
&gt; I&apos;m confused. Can you give a precise example of what you think should be
&gt; invalid but isn&apos;t?

This example is one of such.

&gt; 
&gt; If you mean this:
&gt; 
&gt;    &lt;a href=&quot;a=1&amp;reg=2&quot;&gt;&lt;/a&gt;
&gt; 
&gt; ...then that is intentionally conforming, because it&apos;s very common and
&gt; harmless.

The harm is that it won&apos;t work in IE9 and below. Anyway, I am not all clear about the value of author conformance but it doesn&apos;t seem to be a bad thing to warn the user/raise error whenever the validator sees this.

&gt; (Looks like validator.nu isn&apos;t up to date on that.)

Right.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>76823</commentid>
    <comment_count>5</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2012-10-20 23:33:12 +0000</bug_when>
    <thetext>That IE works differently on this than other browsers is news to me. I thought we checked this really carefully before changing the parser on this.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>76824</commentid>
    <comment_count>6</comment_count>
    <who name="Kang-Hao (Kenny) Lu">kennyluck</who>
    <bug_when>2012-10-21 00:11:27 +0000</bug_when>
    <thetext>(In reply to comment #5)
&gt; That IE works differently on this than other browsers is news to me. I
&gt; thought we checked this really carefully before changing the parser on this.

I verified again that this IE9 and all it&apos;s modes works differently. But this parsing has use cases anyway and I was told that this is fixed in IE10.

Test case: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/1857</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77828</commentid>
    <comment_count>7</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2012-11-04 10:44:01 +0000</bug_when>
    <thetext>FWIW in regard to the validator behavior for the, e.g., &quot;&amp;reg&quot; case: I&apos;ve written a patch to have the validator.nu HTML parser behave per the spec for the general case -- that is, to no longer report any error for semicolon-less strings after ampersands.

However even with that patch applied the validator currently will still report an error for the case of &quot;&amp;reg&quot; and some others that the named-character-references table lists both with and without the semicolon. The actual error it reports is &quot;Named character reference was not terminated by a semicolon&quot;. It only reports that for the special case of things such as &quot;&amp;reg&quot; and not for the case of some arbitrary string like &quot;&amp;foo&quot;.

If IE behavior for &quot;&amp;reg&quot; is as Kenny describes, it seems useful for the validator to emit an error for it -- and so for the spec to define semicolon-less named character references such as &quot;&amp;reg&quot; as a parse error.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77859</commentid>
    <comment_count>8</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2012-11-05 10:26:13 +0000</bug_when>
    <thetext>See bug 19718. I think it was a mistake to make unescaped ampersands be non-errors. If there is an ampersand and it is not followed by a semicolon-terminated entity name, it may well be a typo. It seems better aligned with the purpose of validation to catch those typos than to make it so that www.google.com as it exists has no parse errors (IIRC, the case Sam presented for the trickery under discussion).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77865</commentid>
    <comment_count>9</comment_count>
    <who name="Kang-Hao (Kenny) Lu">kennyluck</who>
    <bug_when>2012-11-05 11:36:48 +0000</bug_when>
    <thetext>(In reply to comment #8)
&gt; See bug 19718. I think it was a mistake to make unescaped ampersands be
&gt; non-errors. 

To be precise, it&apos;s currently an error when an unescaped is parsed into the chacter it represents:

  # Otherwise, a character reference is parsed. If the last character matched 
  # is not a U+003B SEMICOLON character (;), there is a parse error.

&gt; If there is an ampersand and it is not followed by a
&gt; semicolon-terminated entity name, it may well be a typo.

Right. But in a case like &lt;a href=&quot;a=1&amp;reg=2&quot;&gt;, it&apos;s not likely that it&apos;s a typo and the spec makes sense here. This bug is raised because IE9 and below haven&apos;t supported this and the validator should still raise an error for this case *for now*, not after IE10 takes over the world.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77866</commentid>
    <comment_count>10</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2012-11-05 12:12:48 +0000</bug_when>
    <thetext>(In reply to comment #9)
&gt; But in a case like &lt;a href=&quot;a=1&amp;reg=2&quot;&gt;, it&apos;s not likely that it&apos;s a
&gt; typo and the spec makes sense here. This bug is raised because IE9 and below
&gt; haven&apos;t supported this and the validator should still raise an error for
&gt; this case *for now*, not after IE10 takes over the world.

OK, in that case, it’s not a typo, but in that case, compatibility should weigh more than the ability to declare www.google.com free of Parse Errors.

(In reply to comment #7)
&gt; If IE behavior for &quot;&amp;reg&quot; is as Kenny describes

It indeed is. How did we end up with legacy-IE-incompatible tokenization for this case in the first place?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77867</commentid>
    <comment_count>11</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2012-11-05 12:18:40 +0000</bug_when>
    <thetext>(In reply to comment #10)
&gt; (In reply to comment #7)
&gt; &gt; If IE behavior for &quot;&amp;reg&quot; is as Kenny describes
&gt; 
&gt; It indeed is. How did we end up with legacy-IE-incompatible tokenization for
&gt; this case in the first place?

FWIW, IE8 also turns &lt;a href=&quot;&amp;reg=2&quot;&gt; into a ®=2.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77871</commentid>
    <comment_count>12</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2012-11-05 13:15:05 +0000</bug_when>
    <thetext>(In reply to comment #10)
&gt; (In reply to comment #7)
&gt; &gt; If IE behavior for &quot;&amp;reg&quot; is as Kenny describes
&gt; 
&gt; It indeed is. How did we end up with legacy-IE-incompatible tokenization for
&gt; this case in the first place?

We made &lt;a href=&quot;&amp;reg=2&quot;&gt; parse different from legacy IE on the basis that legacy IE was not what people expect and there were few enough pages relying on this that we could change it. I had to argue the case to convince Hixie to change it.

http://lists.w3.org/Archives/Public/public-html/2009Jul/0417.html

(I can&apos;t find the email where it was accepted or the actual spec change right now.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77885</commentid>
    <comment_count>13</comment_count>
    <who name="Kang-Hao (Kenny) Lu">kennyluck</who>
    <bug_when>2012-11-05 14:23:02 +0000</bug_when>
    <thetext>(In reply to comment #12)
&gt; We made &lt;a href=&quot;&amp;reg=2&quot;&gt; parse different from legacy IE on the basis that
&gt; legacy IE was not what people expect and there were few enough pages relying
&gt; on this that we could change it. I had to argue the case to convince Hixie
&gt; to change it.
&gt; 
&gt; http://lists.w3.org/Archives/Public/public-html/2009Jul/0417.html
&gt; 
&gt; (I can&apos;t find the email where it was accepted or the actual spec change
&gt; right now.)

Only after reading this thread do I realize IE9 and below treat &lt;a href=&quot;&amp;copypasta&quot;&gt; according to the spec. So, as the reporter of this bug, my request is changed to the following, to be precise:

add

  | However, if the next character is a a U+003D EQUALS SIGN character (=), 
  | this is a parse error.

after

  # If the character reference is being consumed as part of an attribute, and the
  # last character matched is not a U+003B SEMICOLON character (;), and the next
  # character is either a U+003D EQUALS SIGN character (=) or an alphanumeric
  # ASCII character, then, for historical reasons, all the characters that were
  # matched after the U+0026 AMPERSAND character (&amp;) must be unconsumed, and
  # nothing is returned.

(For what it&apos;s worth, &quot;reg&quot; was a real case.) and I encourage validator.nu to do this.

Thanks for the link!</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77903</commentid>
    <comment_count>14</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2012-11-05 20:01:34 +0000</bug_when>
    <thetext>&gt; is a a 

Was that intended to be &quot;is a&quot; or &quot;is not a&quot;? (Just making sure we&apos;re on the same page... I think either way makes sense. In particular, having this temporarily be a parse error for the near term while IE9 and below are still in wide use seems rather reasonable.)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>77906</commentid>
    <comment_count>15</comment_count>
    <who name="Kang-Hao (Kenny) Lu">kennyluck</who>
    <bug_when>2012-11-05 20:08:39 +0000</bug_when>
    <thetext>(In reply to comment #14)
&gt; &gt; is a a 
&gt; 
&gt; Was that intended to be &quot;is a&quot; or &quot;is not a&quot;? 

&quot;is a&quot;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>82384</commentid>
    <comment_count>16</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2013-01-31 00:35:18 +0000</bug_when>
    <thetext>Ok, I&apos;ve made this a conformance error for the time being.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>82385</commentid>
    <comment_count>17</comment_count>
    <who name="">contributor</who>
    <bug_when>2013-01-31 00:37:03 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r7679.
Check-in comment: Make &lt;a href=&apos;?guitar=2&amp;amp=1&amp;pedal=6&apos;&gt; a parse error since IE9 misparses it &apos;?guitar=2&amp;=1&amp;pedal=6&apos; apparently.
http://html5.org/tools/web-apps-tracker?from=7678&amp;to=7679</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>