<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>12576</bug_id>
          
          <creation_ts>2011-04-30 13:53:00 +0000</creation_ts>
          <short_desc>Need clarification on tokenization of html 5 doc.</short_desc>
          <delta_ts>2011-09-04 17:53:55 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>HTML WG</product>
          <component>LC1 HTML5 spec</component>
          <version>unspecified</version>
          <rep_platform>All</rep_platform>
          <op_sys>All</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>WONTFIX</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Mridul">mridul</reporter>
          <assigned_to name="Ian &apos;Hixie&apos; Hickson">ian</assigned_to>
          <cc>bzbarsky</cc>
    
    <cc>ian</cc>
    
    <cc>mike</cc>
    
    <cc>mridul</cc>
    
    <cc>Ms2ger</cc>
    
    <cc>public-html-admin</cc>
    
    <cc>public-html-wg-issue-tracking</cc>
          
          <qa_contact name="HTML WG Bugzilla archive list">public-html-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>47829</commentid>
    <comment_count>0</comment_count>
    <who name="Mridul">mridul</who>
    <bug_when>2011-04-30 13:53:00 +0000</bug_when>
    <thetext>I was going over sections related to tokenization on html5 spec at http://dev.w3.org/html5/spec/Overview.html (the version as of today).
Have the following queries/comments ... and clarity on the following would be great.


1) In &quot;Before attribute name state&quot; (section 8.2.4.34 right now), on encountering &apos;&lt;&apos;, a new attribute is started with &apos;&lt;&apos; as first character.
Shouldn&apos;t this not trigger a new element while reporting a parse error ?


2) In &quot;Data state&quot; (section 8.2.4.1 right now), on encountering &apos;U+0000&apos;, the current input character is emitted. Everywhere else, it is replaced with U+FFFD. Is this on purpose ? Or a typo ?


3) In &quot;Bogus comment state&quot; (section 8.2.4.44 right now), it would be good if it could be reworded for clarity. As stated, it requires very careful reading to decipher its meaning.


4) In &quot;Bogus comment state&quot; (section 8.2.4.44 right now), if we encounter an EOF, is it not a parse error ? (it delegates to DATA state, where it is not a parse error iirc).


5) Comment (1), if valid, affects pre-parser logic too (to find encoding).


6) In &quot;Determining the character encoding&quot; (section 8.2.2.1 right now), under step 5 (the algo to find encoding from html content) :
Under sub-step 1, case &apos;&lt;meta&apos;, point 12 which currently says -
&quot;If mode is true but got pragma is false, then jump to the second step of the overall &quot;two step&quot; algorithm.&quot;
Here, &apos;mode&apos; is undefined from what I saw : I assume it is supposed to be &apos;need pragma&apos; ?

6.1) In point 13 from same snippet from (6) above, we have : 
&quot;If charset is a UTF-16 encoding, change the value of charset to UTF-8.&quot;
What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too ? Or only for &apos;utf-16&apos; ?


7) In &quot;get an attribute&quot; (#concept-get-attributes-when-sniffing : section 8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or &apos;&gt;&apos;. What about &apos;/&apos; ? Currently, the &apos;/&apos; will get added to the value ... This is applicable in two places in that algo : step 10 and step 11.


Thanks,
Mridul</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>47833</commentid>
    <comment_count>1</comment_count>
    <who name="Boris Zbarsky">bzbarsky</who>
    <bug_when>2011-05-01 16:52:04 +0000</bug_when>
    <thetext>For #1, that behavior is actually purposeful: that&apos;s what IE does in that situation, for example.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>48416</commentid>
    <comment_count>2</comment_count>
    <who name="Mridul">mridul</who>
    <bug_when>2011-05-10 10:44:09 +0000</bug_when>
    <thetext>Any comments or updates on this ?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>51593</commentid>
    <comment_count>3</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-07-28 00:48:40 +0000</bug_when>
    <thetext>For #6, see diff below.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>51594</commentid>
    <comment_count>4</comment_count>
    <who name="">contributor</who>
    <bug_when>2011-07-28 00:49:04 +0000</bug_when>
    <thetext>Checked in as WHATWG revision r6334.
Check-in comment: forgot to fix this one in r5764
http://html5.org/tools/web-apps-tracker?from=6333&amp;to=6334</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>51596</commentid>
    <comment_count>5</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-07-28 00:50:40 +0000</bug_when>
    <thetext>For #7, could you elaborate on what difference it could make?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>51730</commentid>
    <comment_count>6</comment_count>
    <who name="Mridul">mridul</who>
    <bug_when>2011-07-29 06:03:32 +0000</bug_when>
    <thetext>The difference would be whether the value will contain the / or not.
Not to mention, whether it causes the start tag to also end or not ( /&gt; case).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>51825</commentid>
    <comment_count>7</comment_count>
    <who name="Ms2ger">Ms2ger</who>
    <bug_when>2011-07-30 19:44:48 +0000</bug_when>
    <thetext>In the case of &lt;a href=http://www.example.com/&gt;Example&lt;/a&gt;, the spec does make sense, IMO.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>53962</commentid>
    <comment_count>8</comment_count>
    <who name="Michael[tm] Smith">mike</who>
    <bug_when>2011-08-04 05:34:53 +0000</bug_when>
    <thetext>mass-move component to LC1</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>55360</commentid>
    <comment_count>9</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-08-17 22:19:14 +0000</bug_when>
    <thetext>As a general rule, in the future, please file one issue per bug.

&gt; 1) In &quot;Before attribute name state&quot; (section 8.2.4.34 right now), on
&gt; encountering &apos;&lt;&apos;, a new attribute is started with &apos;&lt;&apos; as first character.
&gt; Shouldn&apos;t this not trigger a new element while reporting a parse error ?
&gt; 5) Comment (1), if valid, affects pre-parser logic too (to find encoding).

It&apos;s an error, exactly what happens doesn&apos;t matter so much. I think the current behaviour is more consistent with widespread legacy implementations and is mildly more secure when it comes to XSS attacks.


&gt; 2) In &quot;Data state&quot; (section 8.2.4.1 right now), on encountering &apos;U+0000&apos;, the
&gt; current input character is emitted. Everywhere else, it is replaced with
&gt; U+FFFD. Is this on purpose ? Or a typo ?

It&apos;s on purposes, the tree construction takes care of it for those cases.


&gt; 3) In &quot;Bogus comment state&quot; (section 8.2.4.44 right now), it would be good if
&gt; it could be reworded for clarity. As stated, it requires very careful reading
&gt; to decipher its meaning.

Please file a separate bug for this with more detail about exactly what needs clarifying. In general, very careful reading is to be encouraged. ;-)


&gt; 4) In &quot;Bogus comment state&quot; (section 8.2.4.44 right now), if we encounter an
&gt; EOF, is it not a parse error ? (it delegates to DATA state, where it is not a
&gt; parse error iirc).

Once you hit the bogus comment state you&apos;ve already hit a parse error so it doesn&apos;t matter.


&gt; 6) In &quot;Determining the character encoding&quot; (section 8.2.2.1 right now), under
&gt; step 5 (the algo to find encoding from html content) :
&gt; Under sub-step 1, case &apos;&lt;meta&apos;, point 12 which currently says -
&gt; &quot;If mode is true but got pragma is false, then jump to the second step of the
&gt; overall &quot;two step&quot; algorithm.&quot;
&gt; Here, &apos;mode&apos; is undefined from what I saw : I assume it is supposed to be &apos;need
&gt; pragma&apos; ?

Fixed; see comment 4.


&gt; 6.1) In point 13 from same snippet from (6) above, we have : 
&gt; &quot;If charset is a UTF-16 encoding, change the value of charset to UTF-8.&quot;
&gt; What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too
&gt; ? Or only for &apos;utf-16&apos; ?

UTF-16LE and UTF-16BE are both UTF-16 encodings.


&gt; 7) In &quot;get an attribute&quot; (#concept-get-attributes-when-sniffing : section
&gt; 8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or
&gt; &apos;&gt;&apos;. What about &apos;/&apos; ? Currently, the &apos;/&apos; will get added to the value ... This
&gt; is applicable in two places in that algo : step 10 and step 11.

Could you show a concrete example of a Web page that would be processed differently based on this difference? I don&apos;t fully understand the implications here.



I&apos;m leaving this bug open for point 7. Please open separate bugs for the other points if the above is not sufficient resolution.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>56297</commentid>
    <comment_count>10</comment_count>
    <who name="Ian &apos;Hixie&apos; Hickson">ian</who>
    <bug_when>2011-09-04 17:53:55 +0000</bug_when>
    <thetext>EDITOR&apos;S RESPONSE: This is an Editor&apos;s Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: As far as I can tell, the algorithm referenced in comment 7 matches the HTML parsing algorithm for the &quot;/&quot; case, so no change is needed here.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>