14284 – Need HTML parser algorithm options

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14284 - Need HTML parser algorithm options

Summary: Need HTML parser algorithm options

Status:	CLOSED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML5 spec (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-09-25 16:48 UTC by Anne
Modified:	2012-02-16 10:55 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Anne 2011-09-25 16:48:31 UTC

For XMLHttpRequest we want to use an instance of the HTML parser that does not require reparsing of content and does less of the legacy encoding stuff.

See http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1562.html

It would be nice if "parse HTML" had some flags we could set for this. Henri can probably provide more details if needed.

Comment 1 Henri Sivonen 2011-09-26 05:22:36 UTC

XHR2 could do this without HTML providing flags by saying:
If final MIME type is text/html let document be Document object that represents the response entity body parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled, ignoring internal encoding declarations discovered by the tree builder (i.e. only honoring the <meta> prescan for the first 1024 bytes for internal encoding declarations), without applying frequency analysis and using UTF-8 is the _default encoding_ in the last step of the algorithm for determining the character encoding.

Comment 2 Anne 2011-09-26 05:24:51 UTC

I can do that too, but I would prefer an explicit interface.

Comment 3 Ian 'Hixie' Hickson 2011-09-30 18:21:26 UTC

http://www.w3.org/mid/Pine.LNX.4.64.1109301802140.29849@ps20323.dreamhostps.com

Comment 4 Ian 'Hixie' Hickson 2011-10-04 00:02:19 UTC

http://lists.w3.org/Archives/Public/public-webapps/2011OctDec/0023.html

As far as I can tell, what would be needed is to say that the "encoding sniffing algorithm" should use exactly 1024 bytes (or all of them, if that's less), stalling without exception until either an encoding is found or 1024 bytes are processed, bypassing the optional heuristics and always defaulting to UTF-8.

I suppose I could add a flag to that algorithm to enable that, though it would make it even more complicated which isn't great.

I'm still not sure relying on this algorithm is a good idea at all, though.

Comment 5 Ian 'Hixie' Hickson 2011-10-21 21:40:41 UTC

So what's the story here? Do you still need this? I'm really not sure how to do this sanely in the HTML spec, given how confusing it would make the algorithm.

Comment 6 Anne 2011-10-27 09:17:51 UTC

Waiting for Gecko implementation experience.

Comment 7 Ian 'Hixie' Hickson 2011-11-11 20:02:08 UTC

Please reassign to me if you do still want a change.

Comment 8 Henri Sivonen 2011-11-24 15:06:06 UTC

Gecko does the following for HTML in XHR:
 * If there's a HTTP-level charset, use that.
 * Otherwise, if there's a BOM, use that.
 * Otherwise, run the prescan algorithm over the first 1024 and use the result if there is one.
 * Otherwise, use UTF-8.

So the spec options that the XHR spec needs to flip in the HTML spec are:
 * Use the prescan up to exactly 1024 bytes without a timeout. (FWIW, I think this should always be done.)
 * Turn off the honoring of tree builder-discovered metas.
 * Turn off heuristic detection.
 * Clamp the last resort encoding to UTF-8 instead of a user-defined encoding.

Comment 9 Anne 2011-11-24 15:11:14 UTC

XMLHttpRequest also disables scripting support in the parser. If we are going to have explicit options that should be exposed too.

Comment 10 Ian 'Hixie' Hickson 2011-11-24 22:41:11 UTC

I might just create a new algorithm for this case and factor out the common bits, rather than putting if/else statements all over the algorithm.

Comment 11 Anne 2011-11-25 10:55:14 UTC

That sounds fine. I think eventually we'll have a better overview of which parts of the platform hook into the parser and what kind of granularity they need.

Comment 12 Anne 2012-02-06 21:25:09 UTC

The output should also include the determined encoding somehow so I can assign that to the Document object. Or the input should take a Document object which concept-document-encoding is modified by the algorithm.

Comment 13 Ms2ger 2012-02-13 20:21:46 UTC

DOMParser <http://html5.org/specs/dom-parsing.html#dom-domparser-parsefromstring> and createContextualFragment <http://html5.org/specs/dom-parsing.html#dom-range-createcontextualfragment> need something like this as well.

Comment 14 contributor 2012-02-13 21:07:16 UTC

Checked in as WHATWG revision r6990.
Check-in comment: Factor out the prescan algorithm for reuse in other specs.
http://html5.org/tools/web-apps-tracker?from=6989&to=6990

Comment 15 Ian 'Hixie' Hickson 2012-02-13 21:13:11 UTC

Ok, the XHR spec can now get what is described in comment 8 by creating a new algorithm that is just steps 2, 4, and 5 (with a different end condition) from the "encoding sniffing algorithm" in the HTML spec (these are all pretty short steps), followed by a final step to default to UTF-8, and by making sure the confidence is always marked "certain" (that overrides the <meta> stuff in the tree builder).

Is that OK, or do you want me to define that algorithm in HTML? I mean, I can, but it'd be a bit weird since nothing in the HTML spec uses it. Ms2ger, do you need the same algorithm? If two specs need it, it makes more sense for me to provide it.

Comment 16 Anne 2012-02-13 21:21:52 UTC

That works for XHR.

Comment 17 Ms2ger 2012-02-13 21:28:55 UTC

Please define it in HTML, to make it clear it's the same algorithm. (I guess I could technically reference XHR, but that seems silly.)

Comment 18 Ian 'Hixie' Hickson 2012-02-13 22:00:21 UTC

I don't understand what DOMParser needs. It doesn't have a byte stream at the APIs you mention. As far as I can tell, it doesn't need this.

So since what the spec now has is sufficient for XHR, I'm closing this bug. Please don't hesitate to reopen it or (better) file a new one if you need anything else or if I misundersand DOMParser.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Concurred with reporter's comments.

Comment 19 Anne 2012-02-16 10:55:34 UTC

There we go: http://dvcs.w3.org/hg/xhr/rev/479e69c40ab2

See: http://dvcs.w3.org/hg/xhr/raw-file/tip/Overview.html#document-response-entity-body