This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12576 - Need clarification on tokenization of html 5 doc.
Summary: Need clarification on tokenization of html 5 doc.
Status: RESOLVED WONTFIX
Alias: None
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec (show other bugs)
Version: unspecified
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-04-30 13:53 UTC by Mridul
Modified: 2011-09-04 17:53 UTC (History)
7 users (show)

See Also:


Attachments

Description Mridul 2011-04-30 13:53:00 UTC
I was going over sections related to tokenization on html5 spec at http://dev.w3.org/html5/spec/Overview.html (the version as of today).
Have the following queries/comments ... and clarity on the following would be great.


1) In "Before attribute name state" (section 8.2.4.34 right now), on encountering '<', a new attribute is started with '<' as first character.
Shouldn't this not trigger a new element while reporting a parse error ?


2) In "Data state" (section 8.2.4.1 right now), on encountering 'U+0000', the current input character is emitted. Everywhere else, it is replaced with U+FFFD. Is this on purpose ? Or a typo ?


3) In "Bogus comment state" (section 8.2.4.44 right now), it would be good if it could be reworded for clarity. As stated, it requires very careful reading to decipher its meaning.


4) In "Bogus comment state" (section 8.2.4.44 right now), if we encounter an EOF, is it not a parse error ? (it delegates to DATA state, where it is not a parse error iirc).


5) Comment (1), if valid, affects pre-parser logic too (to find encoding).


6) In "Determining the character encoding" (section 8.2.2.1 right now), under step 5 (the algo to find encoding from html content) :
Under sub-step 1, case '<meta', point 12 which currently says -
"If mode is true but got pragma is false, then jump to the second step of the overall "two step" algorithm."
Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need pragma' ?

6.1) In point 13 from same snippet from (6) above, we have : 
"If charset is a UTF-16 encoding, change the value of charset to UTF-8."
What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too ? Or only for 'utf-16' ?


7) In "get an attribute" (#concept-get-attributes-when-sniffing : section 8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or '>'. What about '/' ? Currently, the '/' will get added to the value ... This is applicable in two places in that algo : step 10 and step 11.


Thanks,
Mridul
Comment 1 Boris Zbarsky 2011-05-01 16:52:04 UTC
For #1, that behavior is actually purposeful: that's what IE does in that situation, for example.
Comment 2 Mridul 2011-05-10 10:44:09 UTC
Any comments or updates on this ?
Comment 3 Ian 'Hixie' Hickson 2011-07-28 00:48:40 UTC
For #6, see diff below.
Comment 4 contributor 2011-07-28 00:49:04 UTC
Checked in as WHATWG revision r6334.
Check-in comment: forgot to fix this one in r5764
http://html5.org/tools/web-apps-tracker?from=6333&to=6334
Comment 5 Ian 'Hixie' Hickson 2011-07-28 00:50:40 UTC
For #7, could you elaborate on what difference it could make?
Comment 6 Mridul 2011-07-29 06:03:32 UTC
The difference would be whether the value will contain the / or not.
Not to mention, whether it causes the start tag to also end or not ( /> case).
Comment 7 Ms2ger 2011-07-30 19:44:48 UTC
In the case of <a href=http://www.example.com/>Example</a>, the spec does make sense, IMO.
Comment 8 Michael[tm] Smith 2011-08-04 05:34:53 UTC
mass-move component to LC1
Comment 9 Ian 'Hixie' Hickson 2011-08-17 22:19:14 UTC
As a general rule, in the future, please file one issue per bug.

> 1) In "Before attribute name state" (section 8.2.4.34 right now), on
> encountering '<', a new attribute is started with '<' as first character.
> Shouldn't this not trigger a new element while reporting a parse error ?
> 5) Comment (1), if valid, affects pre-parser logic too (to find encoding).

It's an error, exactly what happens doesn't matter so much. I think the current behaviour is more consistent with widespread legacy implementations and is mildly more secure when it comes to XSS attacks.


> 2) In "Data state" (section 8.2.4.1 right now), on encountering 'U+0000', the
> current input character is emitted. Everywhere else, it is replaced with
> U+FFFD. Is this on purpose ? Or a typo ?

It's on purposes, the tree construction takes care of it for those cases.


> 3) In "Bogus comment state" (section 8.2.4.44 right now), it would be good if
> it could be reworded for clarity. As stated, it requires very careful reading
> to decipher its meaning.

Please file a separate bug for this with more detail about exactly what needs clarifying. In general, very careful reading is to be encouraged. ;-)


> 4) In "Bogus comment state" (section 8.2.4.44 right now), if we encounter an
> EOF, is it not a parse error ? (it delegates to DATA state, where it is not a
> parse error iirc).

Once you hit the bogus comment state you've already hit a parse error so it doesn't matter.


> 6) In "Determining the character encoding" (section 8.2.2.1 right now), under
> step 5 (the algo to find encoding from html content) :
> Under sub-step 1, case '<meta', point 12 which currently says -
> "If mode is true but got pragma is false, then jump to the second step of the
> overall "two step" algorithm."
> Here, 'mode' is undefined from what I saw : I assume it is supposed to be 'need
> pragma' ?

Fixed; see comment 4.


> 6.1) In point 13 from same snippet from (6) above, we have : 
> "If charset is a UTF-16 encoding, change the value of charset to UTF-8."
> What if it is explicitly set to utf-16LE or utf-16BE ? Should it be changed too
> ? Or only for 'utf-16' ?

UTF-16LE and UTF-16BE are both UTF-16 encodings.


> 7) In "get an attribute" (#concept-get-attributes-when-sniffing : section
> 8.2.2.1 algo in main step 5) : currently a value can end on a whitespace or
> '>'. What about '/' ? Currently, the '/' will get added to the value ... This
> is applicable in two places in that algo : step 10 and step 11.

Could you show a concrete example of a Web page that would be processed differently based on this difference? I don't fully understand the implications here.



I'm leaving this bug open for point 7. Please open separate bugs for the other points if the above is not sufficient resolution.
Comment 10 Ian 'Hixie' Hickson 2011-09-04 17:53:55 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: As far as I can tell, the algorithm referenced in comment 7 matches the HTML parsing algorithm for the "/" case, so no change is needed here.