12576 2011-04-30 13:53:00 +0000 Need clarification on tokenization of html 5 doc. 2011-09-04 17:53:55 +0000 1 1 1 Unclassified HTML WG LC1 HTML5 spec unspecified All All RESOLVED WONTFIX P2 normal --- 1 mridul ian bzbarsky ian mike mridul Ms2ger public-html-admin public-html-wg-issue-tracking public-html-bugzilla oldest_to_newest 47829 0 mridul 2011-04-30 13:53:00 +0000 I was going over sections related to tokenization on html5 spec at http://dev.w3.org/html5/spec/Overview.html (the version as of today). Have the following queries/comments ... and clarity on the following would be great. 1) In "Before attribute name state" (section 8.2.4.34 right now), on encountering '<', a new attribute is started with '<' as first character. Shouldn't this not trigger a new element while reporting a parse error ? 2) In "Data state" (section 8.2.4.1 right now), on encountering 'U+0000', the current input character is emitted. Everywhere else, it is replaced with U+FFFD. Is this on purpose ? Or a typo ? 3) In "Bogus comment state" (section 8.2.4.44 right now), it would be good if it could be reworded for clarity. As stated, it requires very careful reading to decipher its meaning. 4) In "Bogus comment state" (section 8.2.4.44 right now), if we encounter an EOF, is it not a parse error ? (it delegates to DATA state, where it is not a parse error iirc). 5) Comment (1), if valid, affects pre-parser logic too (to find encoding). 6) In "Determining the character encoding" (section 8.2.2.1 right now), under step 5 (the algo to find encoding from html content) : Under sub-step 1, case '<meta', point 12 which currently says - "If mode is true but got pragma is false, then jump to the second step of the overall "two step" algorithm." Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need pragma' ? 6.1) In point 13 from same snippet from (6) above, we have : "If charset is a UTF-16 encoding, change the value of charset to UTF-8." What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too ? Or only for 'utf-16' ? 7) In "get an attribute" (#concept-get-attributes-when-sniffing : section 8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or '>'. What about '/' ? Currently, the '/' will get added to the value ... This is applicable in two places in that algo : step 10 and step 11. Thanks, Mridul 47833 1 bzbarsky 2011-05-01 16:52:04 +0000 For #1, that behavior is actually purposeful: that's what IE does in that situation, for example. 48416 2 mridul 2011-05-10 10:44:09 +0000 Any comments or updates on this ? 51593 3 ian 2011-07-28 00:48:40 +0000 For #6, see diff below. 51594 4 contributor 2011-07-28 00:49:04 +0000 Checked in as WHATWG revision r6334. Check-in comment: forgot to fix this one in r5764 http://html5.org/tools/web-apps-tracker?from=6333&to=6334 51596 5 ian 2011-07-28 00:50:40 +0000 For #7, could you elaborate on what difference it could make? 51730 6 mridul 2011-07-29 06:03:32 +0000 The difference would be whether the value will contain the / or not. Not to mention, whether it causes the start tag to also end or not ( /> case). 51825 7 Ms2ger 2011-07-30 19:44:48 +0000 In the case of <a href=http://www.example.com/>Example</a>, the spec does make sense, IMO. 53962 8 mike 2011-08-04 05:34:53 +0000 mass-move component to LC1 55360 9 ian 2011-08-17 22:19:14 +0000 As a general rule, in the future, please file one issue per bug. > 1) In "Before attribute name state" (section 8.2.4.34 right now), on > encountering '<', a new attribute is started with '<' as first character. > Shouldn't this not trigger a new element while reporting a parse error ? > 5) Comment (1), if valid, affects pre-parser logic too (to find encoding). It's an error, exactly what happens doesn't matter so much. I think the current behaviour is more consistent with widespread legacy implementations and is mildly more secure when it comes to XSS attacks. > 2) In "Data state" (section 8.2.4.1 right now), on encountering 'U+0000', the > current input character is emitted. Everywhere else, it is replaced with > U+FFFD. Is this on purpose ? Or a typo ? It's on purposes, the tree construction takes care of it for those cases. > 3) In "Bogus comment state" (section 8.2.4.44 right now), it would be good if > it could be reworded for clarity. As stated, it requires very careful reading > to decipher its meaning. Please file a separate bug for this with more detail about exactly what needs clarifying. In general, very careful reading is to be encouraged. ;-) > 4) In "Bogus comment state" (section 8.2.4.44 right now), if we encounter an > EOF, is it not a parse error ? (it delegates to DATA state, where it is not a > parse error iirc). Once you hit the bogus comment state you've already hit a parse error so it doesn't matter. > 6) In "Determining the character encoding" (section 8.2.2.1 right now), under > step 5 (the algo to find encoding from html content) : > Under sub-step 1, case '<meta', point 12 which currently says - > "If mode is true but got pragma is false, then jump to the second step of the > overall "two step" algorithm." > Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need > pragma' ? Fixed; see comment 4. > 6.1) In point 13 from same snippet from (6) above, we have : > "If charset is a UTF-16 encoding, change the value of charset to UTF-8." > What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too > ? Or only for 'utf-16' ? UTF-16LE and UTF-16BE are both UTF-16 encodings. > 7) In "get an attribute" (#concept-get-attributes-when-sniffing : section > 8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or > '>'. What about '/' ? Currently, the '/' will get added to the value ... This > is applicable in two places in that algo : step 10 and step 11. Could you show a concrete example of a Web page that would be processed differently based on this difference? I don't fully understand the implications here. I'm leaving this bug open for point 7. Please open separate bugs for the other points if the above is not sufficient resolution. 56297 10 ian 2011-09-04 17:53:55 +0000 EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: http://dev.w3.org/html5/decision-policy/decision-policy.html Status: Rejected Change Description: no spec change Rationale: As far as I can tell, the algorithm referenced in comment 7 matches the HTML parsing algorithm for the "/" case, so no change is needed here.