This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
For XMLHttpRequest we want to use an instance of the HTML parser that does not require reparsing of content and does less of the legacy encoding stuff. See http://lists.w3.org/Archives/Public/public-webapps/2011JulSep/1562.html It would be nice if "parse HTML" had some flags we could set for this. Henri can probably provide more details if needed.
XHR2 could do this without HTML providing flags by saying: If final MIME type is text/html let document be Document object that represents the response entity body parsed following the rules set forth in the HTML specification for an HTML parser with scripting disabled, ignoring internal encoding declarations discovered by the tree builder (i.e. only honoring the <meta> prescan for the first 1024 bytes for internal encoding declarations), without applying frequency analysis and using UTF-8 is the _default encoding_ in the last step of the algorithm for determining the character encoding.
I can do that too, but I would prefer an explicit interface.
http://www.w3.org/mid/Pine.LNX.4.64.1109301802140.29849@ps20323.dreamhostps.com
http://lists.w3.org/Archives/Public/public-webapps/2011OctDec/0023.html As far as I can tell, what would be needed is to say that the "encoding sniffing algorithm" should use exactly 1024 bytes (or all of them, if that's less), stalling without exception until either an encoding is found or 1024 bytes are processed, bypassing the optional heuristics and always defaulting to UTF-8. I suppose I could add a flag to that algorithm to enable that, though it would make it even more complicated which isn't great. I'm still not sure relying on this algorithm is a good idea at all, though.
So what's the story here? Do you still need this? I'm really not sure how to do this sanely in the HTML spec, given how confusing it would make the algorithm.
Waiting for Gecko implementation experience.
Please reassign to me if you do still want a change.
Gecko does the following for HTML in XHR: * If there's a HTTP-level charset, use that. * Otherwise, if there's a BOM, use that. * Otherwise, run the prescan algorithm over the first 1024 and use the result if there is one. * Otherwise, use UTF-8. So the spec options that the XHR spec needs to flip in the HTML spec are: * Use the prescan up to exactly 1024 bytes without a timeout. (FWIW, I think this should always be done.) * Turn off the honoring of tree builder-discovered metas. * Turn off heuristic detection. * Clamp the last resort encoding to UTF-8 instead of a user-defined encoding.
XMLHttpRequest also disables scripting support in the parser. If we are going to have explicit options that should be exposed too.
I might just create a new algorithm for this case and factor out the common bits, rather than putting if/else statements all over the algorithm.
That sounds fine. I think eventually we'll have a better overview of which parts of the platform hook into the parser and what kind of granularity they need.
The output should also include the determined encoding somehow so I can assign that to the Document object. Or the input should take a Document object which concept-document-encoding is modified by the algorithm.
DOMParser <http://html5.org/specs/dom-parsing.html#dom-domparser-parsefromstring> and createContextualFragment <http://html5.org/specs/dom-parsing.html#dom-range-createcontextualfragment> need something like this as well.
Checked in as WHATWG revision r6990. Check-in comment: Factor out the prescan algorithm for reuse in other specs. http://html5.org/tools/web-apps-tracker?from=6989&to=6990
Ok, the XHR spec can now get what is described in comment 8 by creating a new algorithm that is just steps 2, 4, and 5 (with a different end condition) from the "encoding sniffing algorithm" in the HTML spec (these are all pretty short steps), followed by a final step to default to UTF-8, and by making sure the confidence is always marked "certain" (that overrides the <meta> stuff in the tree builder). Is that OK, or do you want me to define that algorithm in HTML? I mean, I can, but it'd be a bit weird since nothing in the HTML spec uses it. Ms2ger, do you need the same algorithm? If two specs need it, it makes more sense for me to provide it.
That works for XHR.
Please define it in HTML, to make it clear it's the same algorithm. (I guess I could technically reference XHR, but that seems silly.)
I don't understand what DOMParser needs. It doesn't have a byte stream at the APIs you mention. As far as I can tell, it doesn't need this. So since what the spec now has is sufficient for XHR, I'm closing this bug. Please don't hesitate to reopen it or (better) file a new one if you need anything else or if I misundersand DOMParser. EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document: http://dev.w3.org/html5/decision-policy/decision-policy.html Status: Accepted Change Description: see diff given below Rationale: Concurred with reporter's comments.
There we go: http://dvcs.w3.org/hg/xhr/rev/479e69c40ab2 See: http://dvcs.w3.org/hg/xhr/raw-file/tip/Overview.html#document-response-entity-body