<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>9207</bug_id>
          
          <creation_ts>2010-03-08 01:25:39 +0000</creation_ts>
          <short_desc>Anything else:  This part of the spec is problematic, for example, a query string variable &amp;lang_id=1 in as part of an attribute of say an img tag, will get converted into an character token when it shouldn&apos;t be.  Why is the set of characters a-z, A-Z, 0-</short_desc>
          <delta_ts>2010-10-04 13:57:28 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML WG</product>
          <component>pre-LC1 HTML5 spec (editor: Ian Hickson)</component>
          <version>unspecified</version>
          <rep_platform>Other</rep_platform>
          <op_sys>other</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc>http://www.whatwg.org/specs/web-apps/current-work/#tokenizing-character-references</bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P3</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>LC</target_milestone>
          
          <blocked>9352</blocked>
          <everconfirmed>1</everconfirmed>
          <reporter>contributor</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>ayg</cc>
    
    <cc>ian</cc>
    
    <cc>julian.reschke</cc>
    
    <cc>mike</cc>
    
    <cc>mirthy</cc>
    
    <cc>mjs</cc>
    
    <cc>Ms2ger</cc>
    
    <cc>public-html-admin</cc>
    
    <cc>public-html-wg-issue-tracking</cc>
          
          <qa_contact name="HTML WG Bugzilla archive list">public-html-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>32840</commentid>
    <comment_count>0</comment_count>
    <who name="">contributor</who>
    <bug_when>2010-03-08 01:25:39 +0000</bug_when>
    <thetext>Section: http://www.whatwg.org/specs/web-apps/current-work/#tokenizing-character-references

Comment:
Anything else:	This part of the spec is problematic, for example, a query
string variable &amp;lang_id=1 in as part of an attribute of say an img tag, will
get converted into an character token when it shouldn&apos;t be.  Why is the set of
characters a-z, A-Z, 0-9?  This poses a unique problem for any entities that
aren&apos;t closed properly.

Posted from: 146.115.114.89</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>32841</commentid>
    <comment_count>1</comment_count>
    <who name="Jeff">mirthy</who>
    <bug_when>2010-03-08 01:28:58 +0000</bug_when>
    <thetext>Discovered as part of a webkit bug that uses an HTML5 spec tokenizer:
https://bugs.webkit.org/show_bug.cgi?id=35831

Examples where the tokenizer will mangle the URL:
&lt;img src=&quot;http://www.webkit.org/getImage.aspx?id=12345&amp;lang_id=1&quot;/&gt;
&amp;amp_energy=100
&amp;lt-now=10


(In reply to comment #0)
&gt; Section:
&gt; http://www.whatwg.org/specs/web-apps/current-work/#tokenizing-character-references
&gt; 
&gt; Comment:
&gt; Anything else:  This part of the spec is problematic, for example, a query
&gt; string variable &amp;lang_id=1 in as part of an attribute of say an img tag, will
&gt; get converted into an character token when it shouldn&apos;t be.  Why is the set of
&gt; characters a-z, A-Z, 0-9?  This poses a unique problem for any entities that
&gt; aren&apos;t closed properly.
&gt; 
&gt; Posted from: 146.115.114.89
&gt; 

</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>32842</commentid>
    <comment_count>2</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2010-03-08 08:35:31 +0000</bug_when>
    <thetext>It&apos;s for IE compat.

Personally I don&apos;t see much problem with extending the list to include underscore, equals sign, and other characters that would improve or at least not hurt Web compat, based on research. (I research the equals sign and concluded that it would be reasonably safe to change, but Hixie rejected the proposal. http://lists.w3.org/Archives/Public/public-html/2009Jul/0421.html )</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>33615</commentid>
    <comment_count>3</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2010-03-16 13:00:18 +0000</bug_when>
    <thetext>Some raw data at: http://philip.html5.org/data/entities-in-attribute-with-no-semicolon-or-alphanumeric.txt

# [13:54] &lt;Philip`&gt; zcorpan: I recently remembered that I forgot to do http://philip.html5.org/data/entities-in-attribute-with-no-semicolon-or-alphanumeric.txt  
# [13:55] &lt;Philip`&gt; I changed your regexp a bit to exclude &gt; from attribute &apos;values&apos;, because otherwise it seemed to pick up a lot of entities that were in content after an element with attributes  
# [13:56] &lt;Philip`&gt; zcorpan: Also I think the file is probably truncated, because it seemed to get stuck at some point (I guess a page hit a worst-complexity case in the pattern) so I killed it after it had stopped for a few minutes  

-- http://krijnhoetmer.nl/irc-logs/whatwg/20100316#l-254</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>33909</commentid>
    <comment_count>4</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2010-03-25 15:13:58 +0000</bug_when>
    <thetext>http://simon.html5.org/dump/entities-in-attribute-with-no-semicolon-or-alphanumeric.xml counts the occurrences for each character appearing after the entity:

/ 7
, 29
&quot; 546
- 16
  185
&amp; 108
% 499
&gt; 24
&apos; 37
_ 10
: 62
= 36
Â 1
å 4
. 30
Ð 4
Å 2
ç 2
# 1
&lt; 1

The &quot; and &apos; items are probably mostly the attribute value&apos;s end quote.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34075</commentid>
    <comment_count>5</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2010-03-29 08:48:13 +0000</bug_when>
    <thetext>I&apos;ve analyzed the individual cases:

/ 7 should be replaced
, 29 should be replaced (sometimes , is a typoed ; but not always)
&quot; 546 should be replaced (end of attribute value)
- 16 should most often be replaced. exception: &lt;a href=&quot;http://www.promedia-med.com/index.php?shp=3&amp;cl=details&amp;cnid=0bd466d6983607786.88175745&amp;anid=0bd466d69855b9c84.32939037&amp;micro-Pipettierhelfer&amp;&quot; class=&quot;extrabold link1&quot;&gt;micro-Pipettierhelfer&lt;/a&gt;
  185 should be replaced
&amp; 108 should be replaced
% 499 should be replaced and following %3b should be consumed also? &lt;a href=&quot;/View/Dealer/Boat-City/TA3803.aspx?Ne=23&amp;N=29&amp;amp%3bsid=1184817ADB85&amp;amp%3bN=0&quot; target=&quot;_blank&quot;&gt;
&gt; 24 should be replaced (end of unquoted attribute value)
&apos; 37 should be replaced (end of single quoted attribute value or nested js string)
_ 10 should not, or should. &lt;a href=&quot;http://del.icio.us/post?tittle=&amp;url=http://mujer.terra.es/muj/articulo/articulo.cfm?id=mu214379&amp;not_estatica=1&amp;gen=1&quot;&gt; vs &lt;a href=&quot;http://www.libridvd.it/prezzo_libro-autodaf&amp;egrave_leuropa_gli_ebrei_e_lantisemitismo-9788871806068.html&quot;&gt;&lt;img src=&quot;http://www.libridvd.it/immagini/scheda.gif&quot; alt=&quot;Scheda libro Autodaf&amp;egrave;. L&amp;#39;Europa, gli ebrei e l&amp;#39;antisemitismo&quot; border=&quot;0&quot;&gt;&lt;/a&gt;
: 62 should be replaced and the : should be consumed also (it is typoed ; )
= 36 should mostly not be replaced, but some cases unclear e.g. &amp;amp=
Â 1 n/a (encoding error)
å 4 n/a (encoding error)
. 30 should be replaced
Ð 4 n/a (encoding error)
Å 2 n/a (encoding error?)
ç 2 n/a (encoding error)
# 1 n/a (double escaped NCR + encoding error)
&lt; 1 n/a?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34076</commentid>
    <comment_count>6</comment_count>
    <who name="Simon Pieters">zcorpan</who>
    <bug_when>2010-03-29 09:06:24 +0000</bug_when>
    <thetext>If anyone wants to look at the individual cases, I modified the script in http://simon.html5.org/dump/entities-in-attribute-with-no-semicolon-or-alphanumeric.xml as follows to show only &apos;/&apos;:

&lt;script&gt;&lt;![CDATA[
onload=function(){
  var pre = document.getElementsByTagName(&apos;pre&apos;)[1];
  var lines = pre.textContent.split(&apos;\n&apos;);
  var data = [];
  var chars = {};
  var tmp;
  var tmp2;
  for (var i = 0; i &lt; lines.length; ++i) {
    tmp = /^([^\t]+)\t(.+)$/.exec(lines[i]);
    tmp2 = /&amp;(AElig|AMP|Aacute|Acirc|Agrave|Aring|Atilde|Auml|COPY|Ccedil|ETH|Eacute|Ecirc|Egrave|Euml|GT|Iacute|Icirc|Igrave|Iuml|LT|Ntilde|Oacute|Ocirc|Ograve|Oslash|Otilde|Ouml|QUOT|REG|THORN|Uacute|Ucirc|Ugrave|Uuml|Yacute|aacute|acirc|acute|aelig|agrave|amp|aring|atilde|auml|brvbar|ccedil|cedil|cent|copy|curren|deg|divide|eacute|ecirc|egrave|eth|euml|frac12|frac14|frac34|gt|iacute|iuml|laquo|lt|macr|micro|middot|nbsp|not|ntilde|oacute|ocirc|ograve|ordf|ordm|oslash|otilde|ouml|para|plusmn|pound|quot|raquo|reg|sect|shy|sup1|sup2|sup3|szlig|thorn|times|uacute|ucirc|ugrave|uml|uuml|yacute|yen|yuml)([^;a-zA-Z0-9])/.exec(tmp[2]);
    for (var j = 2; j &lt; tmp2.length; ++j) {

      // CHANGE HERE:
      if (tmp2[j] == &apos;/&apos;)
        data.push([tmp[1], tmp[2]]);
    }
  }
  for (i = 0; i &lt; data.length; ++i) {
    document.getElementsByTagName(&apos;pre&apos;)[0].textContent += data[i][1] + &apos;\n\n&apos;;
  }
}
]]&gt;&lt;/script&gt;</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34440</commentid>
    <comment_count>7</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-04-02 21:44:14 +0000</bug_when>
    <thetext>see also bug 9351</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34443</commentid>
    <comment_count>8</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-04-02 22:39:23 +0000</bug_when>
    <thetext>EDITOR&apos;S RESPONSE: This is an Editor&apos;s Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: see diff given below
Rationale: Based on the data, I&apos;ve only changed this for &apos;=&apos;.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34445</commentid>
    <comment_count>9</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-04-02 22:39:54 +0000</bug_when>
    <thetext>*** Bug 9351 has been marked as a duplicate of this bug. ***</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>34447</commentid>
    <comment_count>10</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2010-04-02 22:41:03 +0000</bug_when>
    <thetext>http://html5.org/tools/web-apps-tracker?from=4958&amp;to=4959</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>