<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>9351</bug_id>
          
          <creation_ts>2010-03-27 22:41:00 +0000</creation_ts>
          <short_desc>Do not interpret &amp; followed by an entity name followed by = as an entity reference in attribute values (maybe in text content too)</short_desc>
          <delta_ts>2010-10-04 13:55:41 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML WG</product>
          <component>pre-LC1 HTML5 spec (editor: Ian Hickson)</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>DUPLICATE</resolution>
          <dup_id>9207</dup_id>
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Maciej Stachowiak">mjs</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>ayg</cc>
    
    <cc>hsivonen</cc>
    
    <cc>ian</cc>
    
    <cc>mike</cc>
    
    <cc>public-html-admin</cc>
    
    <cc>public-html-wg-issue-tracking</cc>
    
    <cc>rubys</cc>
          
          <qa_contact name="HTML WG Bugzilla archive list">public-html-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>33971</commentid>
    <comment_count>0</comment_count>
    <who name="Maciej Stachowiak">mjs</who>
    <bug_when>2010-03-27 22:41:00 +0000</bug_when>
    <thetext>It&apos;s been suggested that an unterminated entity (one not followed by a semicolon) that is followed by an equal sign in an attribute value should not be treated as an entity reference.

It seems that rather few pages overall would be affected by changing this, one study found 50 occurrences  out of approximately 425k pages:
http://lists.w3.org/Archives/Public/public-html/2009Jun/0463.html

It was also reported that most of these occurrences appeared to be cases where the author did not expect their text to be interpreted as an entity reference, and review of these 50 instances seems to confirm that impression.

It seems like there is at least some content that would be broken by changing the interpretation:
http://lists.w3.org/Archives/Public/public-html/2009Jul/0421.html

Specifically, it seems that &amp;amp= may occasionally be intended as &quot;&amp;amp;&quot; rather than as &quot;&amp;amp;amp=&quot; or &quot;&amp;amp;=&quot;.

On the whole, it still seems like the proposed change would still fix more content than it breaks.

One possible variable is to exclude &amp;amp= from this change. However, it seems that if authors write that when they mean &quot;&amp;amp;&quot;, then their content will not work as intended even under either the existing parsing rule or the proposed new one. However, content that writes &quot;&amp;amp=&quot; when it means &quot;&amp;amp;amp=&quot; would be fixed. There were instances of both in the data set.

In addition to the direct benefits of fixing content, this change would also make it safer to change parsing rules to allow unescaped &amp; in attributes.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34071</commentid>
    <comment_count>1</comment_count>
    <who name="Henri Sivonen">hsivonen</who>
    <bug_when>2010-03-29 07:55:26 +0000</bug_when>
    <thetext>What&apos;s the rationale of restricting this change request to attribute values? URLs also get copied and pasted to element content occasionally and making them tokenize differently there would violate the principle of least surprise.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34072</commentid>
    <comment_count>2</comment_count>
    <who name="Maciej Stachowiak">mjs</who>
    <bug_when>2010-03-29 08:17:30 +0000</bug_when>
    <thetext>(In reply to comment #1)
&gt; What&apos;s the rationale of restricting this change request to attribute values?
&gt; URLs also get copied and pasted to element content occasionally and making them
&gt; tokenize differently there would violate the principle of least surprise.
&gt; 

After some consideration, I think this change should probably be applied, though I would also be limited by a version limited only to attribute values. Updating title accordingly.

The reasons I originally asked only for attribute values (which I no longer think are valid):

1) I thought the study evaluating occurrence of this pattern was limited to attribute values, so there was no information about the risk of the greater change. But that&apos;s not right - it included cases in text content and in script as well.

2) I did not think there was any benefit in text content, but I failed to think about the case of a URL in text content, or copying between attributes in text content. It&apos;s likely that the benefit is lower in text content, but the pattern seems much less likely to occur in the first place.

Title amended.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34073</commentid>
    <comment_count>3</comment_count>
    <who name="Maciej Stachowiak">mjs</who>
    <bug_when>2010-03-29 08:35:42 +0000</bug_when>
    <thetext>One interesting side note: processing of entities (aka &quot;named character references&quot; apparently) is just about the only thing that is already different between quoted attribute values and text content, as far as the tokenizer is concerned:

http://dev.w3.org/html5/spec/Overview.html#attribute-value-double-quoted-state
http://dev.w3.org/html5/spec/Overview.html#data-state

Making = get treated as an extra &quot;additional allowed character&quot; would in fact have the desired effect suggested by this bug.

That being said, I think it might still be a good idea to do the change everywhere.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34074</commentid>
    <comment_count>4</comment_count>
    <who name="Maciej Stachowiak">mjs</who>
    <bug_when>2010-03-29 08:47:35 +0000</bug_when>
    <thetext>Further correction: the &quot;additional allowed character&quot; is a pretty minor difference, because it applies only right after the ampersand. The actual key difference is:

&quot;If the character reference is being consumed as part of an attribute, and the last character matched is not a U+003B SEMICOLON character (;), and the next character is in the range U+0030 DIGIT ZERO (0) to U+0039 DIGIT NINE (9), U+0041 LATIN CAPITAL LETTER A to U+005A LATIN CAPITAL LETTER Z, or U+0061 LATIN SMALL LETTER A to U+007A LATIN SMALL LETTER Z, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&amp;) must be unconsumed, and nothing is returned.&quot;

Adding = to that list would effect the change proposed here, but only for attribute values, not for text. However, since processing of entities is already quite a bit different between attribute values and text content, I am no longer so sure that making this change for text content too is warranted. Updating title to bias it a bit the other way.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34441</commentid>
    <comment_count>5</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-04-02 21:44:14 +0000</bug_when>
    <thetext>see also bug 9207</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34444</commentid>
    <comment_count>6</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-04-02 22:39:54 +0000</bug_when>
    <thetext>

*** This bug has been marked as a duplicate of bug 9207 ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34446</commentid>
    <comment_count>7</comment_count>
    <who name="Maciej Stachowiak">mjs</who>
    <bug_when>2010-04-02 22:40:27 +0000</bug_when>
    <thetext>Agree that this is a duplicate given the resolution of 9207</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>