This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 8268 - XMLHttpRequest fails for documents with named entities due to doctype
Summary: XMLHttpRequest fails for documents with named entities due to doctype
Status: CLOSED WORKSFORME
Alias: None
Product: HTML WG
Classification: Unclassified
Component: pre-LC1 HTML5 spec (editor: Ian Hickson) (show other bugs)
Version: unspecified
Hardware: All All
: P3 normal
Target Milestone: ---
Assignee: Ian 'Hixie' Hickson
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords: NE
Depends on:
Blocks:
 
Reported: 2009-11-12 03:55 UTC by Aryeh Gregor
Modified: 2010-10-04 14:29 UTC (History)
5 users (show)

See Also:


Attachments

Description Aryeh Gregor 2009-11-12 03:55:06 UTC
Wikipedia just experimented with switching to an HTML5 doctype.  A lot of user tools broke, and after two hours of investigation, we determined that the problem is intractable and switched back to XHTML 1.0 Transitional.

XMLHttpRequest was historically intended only for XML, and lots of scripts rely on the responseXML property being set to a Document.  In current browsers, this only happens when the document is actually well-formed XML.  But named entities are treated differently based on the doctype.  Consider this document:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html><head>
<title>Hello</title>
</head>
<body>
<p>&nbsp;</p>
</body>
</html>

This works just fine in all browsers I tested in (latestish versions of Firefox, Chrome, Opera).  However, if you serve the exact same document but replace the doctype with <!DOCTYPE html>, all of them throw a syntax error on &nbsp;.

Practically speaking, this means that any site that wants to serve content compatible with XHR cannot use either of the two doctypes that the spec recommends for authors.  There are a variety of widely-used scripts on Wikipedia that rely on XHR, so this is currently a blocker for us.  It's very unlikely that we'll deploy HTML5 in the foreseeable future if it means our users have to rewrite all their scripts.  I'm pretty sure that XHR is used for screen-scraping beyond Wikipedia, too, so this will probably crop up elsewhere too.

I don't know what the extent of the magic is that causes this problem.  Could some reasonably minimal, distinctive doctype be invented that would avoid the problem but not make the document look to humans and validators like it thinks it's some old version of XHTML?  If an existing XHTML doctype must be reused, should validators continue to raise warnings as they do now, or should an XHTML doctype be promoted from "obsolete permitted DOCTYPE" to a fully permitted doctype?

Also, is this a wider problem?  Are there any other tools besides browsers that might be magically allowing named entities for some doctypes only?
Comment 1 Michael[tm] Smith 2009-11-12 04:00:21 UTC
Aryeh Gregor:
> Wikipedia just experimented with switching to an HTML5 doctype.  A lot of user
> tools broke, and after two hours of investigation, we determined that the
> problem is intractable and switched back to XHTML 1.0 Transitional.
> 
> XMLHttpRequest was historically intended only for XML, and lots of scripts rely
> on the responseXML property being set to a Document.  In current browsers, this
> only happens when the document is actually well-formed XML.  But named entities
> are treated differently based on the doctype.  Consider this document:
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
> <html><head>
> <title>Hello</title>
> </head>
> <body>
> <p>&nbsp;</p>
> </body>
> </html>
> 
> This works just fine in all browsers I tested in (latestish versions of
> Firefox, Chrome, Opera).  However, if you serve the exact same document but
> replace the doctype with <!DOCTYPE html>, all of them throw a syntax error on
> &nbsp;.
> 
> Practically speaking, this means that any site that wants to serve content
> compatible with XHR cannot use either of the two doctypes that the spec
> recommends for authors.  There are a variety of widely-used scripts on
> Wikipedia that rely on XHR, so this is currently a blocker for us.  It's very
> unlikely that we'll deploy HTML5 in the foreseeable future if it means our
> users have to rewrite all their scripts.  I'm pretty sure that XHR is used for
> screen-scraping beyond Wikipedia, too, so this will probably crop up elsewhere
> too.
> 
> I don't know what the extent of the magic is that causes this problem.  Could
> some reasonably minimal, distinctive doctype be invented that would avoid the
> problem but not make the document look to humans and validators like it thinks
> it's some old version of XHTML?  If an existing XHTML doctype must be reused,
> should validators continue to raise warnings as they do now, or should an XHTML
> doctype be promoted from "obsolete permitted DOCTYPE" to a fully permitted
> doctype?
> 
> Also, is this a wider problem?  Are there any other tools besides browsers that
> might be magically allowing named entities for some doctypes only?
> 

[no comment, just repeating the problem description for purposes of echoing it to public-html]
Comment 2 Henri Sivonen 2009-11-12 07:58:21 UTC
(In reply to comment #0)
> It's very unlikely that we'll deploy HTML5 in the foreseeable future if it means our
> users have to rewrite all their scripts.

The XHTML 1.0 doctypes are conforming in HTML5 (even in text/html), so you could use the XHTML 1.0 Strict or XHTML 1.0 Transitional doctype and still validate stuff like <video> as HTML5.

> I'm pretty sure that XHR is used for
> screen-scraping beyond Wikipedia, too, so this will probably crop up elsewhere
> too.

Even though Mediawiki is hugely popular, there probably aren't too many code bases around that both serve polyglotish content as text/html now and that have an XML consumer ecosystem.

> Could
> some reasonably minimal, distinctive doctype be invented that would avoid the
> problem but not make the document look to humans and validators like it thinks
> it's some old version of XHTML?

No, since the list of doctypes that make &nbsp; work in already-shipped browsers is what it is. If you want compat with already shipped browsers, you have to use a doctype that is on the magic list in Gecko, WebKit and Opera.

OTOH, if you are OK with browsers changing, you could wait for XHR2 to add HTML parsing to XHR.

> Also, is this a wider problem?  Are there any other tools besides browsers that
> might be magically allowing named entities for some doctypes only?

It's very likely that there are tools that only support a closed catalog and, for security or performance reasons, refuse fetch arbitrary DTDs.

Considering that the XHTML 1.0 doctypes are already valid, I'm not convinced that removing the advertisement of the shorter doctype is that right thing on balance, since for your situation to occur, you must have mostly successfully have served polyglotish content to begin with, and that's so hard that it's unlikely that many other code bases have succeeded in it.
Comment 3 Aryeh Gregor 2009-11-12 15:13:47 UTC
(In reply to comment #2)
> The XHTML 1.0 doctypes are conforming in HTML5 (even in text/html), so you
> could use the XHTML 1.0 Strict or XHTML 1.0 Transitional doctype and still
> validate stuff like <video> as HTML5.

AFAICT only Strict is conforming, not Transitional, right?  I assume the idea is to trigger full standards mode.

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#obsolete-permitted-doctype

> Considering that the XHTML 1.0 doctypes are already valid, I'm not convinced
> that removing the advertisement of the shorter doctype is that right thing on
> balance, since for your situation to occur, you must have mostly successfully
> have served polyglotish content to begin with, and that's so hard that it's
> unlikely that many other code bases have succeeded in it.

Apparently this is harder than I thought, yes.  So the resolution is that any HTML5 document that wants to work with XHR either has to raise a validator warning (due to obsolete but conforming doctype) or not use named entities?  Or would it be a good idea to make XHTML1 Strict (say) conforming and not obsolete, but say authors shouldn't use it unless they want to be compatible with named entities in XML?  Presumably XHR with named entities can't be more marginal a use-case than "HTML generators that cannot output HTML markup with the short DOCTYPE '<!DOCTYPE HTML>'".
Comment 4 Henri Sivonen 2009-11-12 15:35:20 UTC
(In reply to comment #3)
> (In reply to comment #2)
> > The XHTML 1.0 doctypes are conforming in HTML5 (even in text/html), so you
> > could use the XHTML 1.0 Strict or XHTML 1.0 Transitional doctype and still
> > validate stuff like <video> as HTML5.
> 
> AFAICT only Strict is conforming, not Transitional, right?  I assume the idea
> is to trigger full standards mode.
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#obsolete-permitted-doctype

Oops. You are right.

> > Considering that the XHTML 1.0 doctypes are already valid, I'm not convinced
> > that removing the advertisement of the shorter doctype is that right thing on
> > balance, since for your situation to occur, you must have mostly successfully
> > have served polyglotish content to begin with, and that's so hard that it's
> > unlikely that many other code bases have succeeded in it.
> 
> Apparently this is harder than I thought, yes.  So the resolution is that any
> HTML5 document that wants to work with XHR either has to raise a validator
> warning (due to obsolete but conforming doctype) or not use named entities? 

Those options seem sensible.

> Or
> would it be a good idea to make XHTML1 Strict (say) conforming and not
> obsolete, but say authors shouldn't use it unless they want to be compatible
> with named entities in XML? 

How would you feel if Validator.nu issued a discretionary warning (warning put in at the discretion of the validator developer as opposed to the spec) saying that the doctype shouldn't be used unless you want entity compat with XHR?

Mediawiki is doing something unusual here. I don't see why you should have zero warnings if you are doing something unusual even if conforming if the warnings might be educational to someone else (specifically, authors who are unaware that there's now a doctype that's memorable so they can stop copying and pasting lengthy incantations).

> Presumably XHR with named entities can't be more
> marginal a use-case than "HTML generators that cannot output HTML markup with
> the short DOCTYPE '<!DOCTYPE HTML>'".

Replacing about:legacy-compat with the XHTML 1.0 Strict doctype would fail to highlight that doctype is permitted only for legacy tools.
Comment 5 Henri Sivonen 2009-11-12 15:42:30 UTC
(In reply to comment #4)
> Replacing about:legacy-compat with the XHTML 1.0 Strict doctype would fail to
> highlight that doctype is permitted only for legacy tools.

That was worded badly. Trying again:
Replacing about:legacy-compat with the XHTML 1.0 Strict doctype would fail to highlight that crufties doctype exists only to the benefit legacy tools. (The point is to make the alternative doctype unshiny so that pundits don't start preaching it for general usage.)
Comment 6 Aryeh Gregor 2009-11-12 15:51:44 UTC
(In reply to comment #4)
> How would you feel if Validator.nu issued a discretionary warning (warning put
> in at the discretion of the validator developer as opposed to the spec) saying
> that the doctype shouldn't be used unless you want entity compat with XHR?
> 
> Mediawiki is doing something unusual here. I don't see why you should have zero
> warnings if you are doing something unusual even if conforming if the warnings
> might be educational to someone else (specifically, authors who are unaware
> that there's now a doctype that's memorable so they can stop copying and
> pasting lengthy incantations).

I agree that use of the doctype should be discouraged.  On consideration, a warning does seem appropriate, as long as it's clear that it might not be an actual problem with this particular page.  Currently, validator.nu raises an error, which is wrong per the spec.

At this point I'm not sure any spec change is needed.
Comment 7 Henri Sivonen 2009-11-12 15:57:50 UTC
(In reply to comment #6)
> Currently, validator.nu raises an error, which is wrong per the spec.

I pasted the following in the text area on http://html5.validator.nu/ and I got a warning only.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title></title>
</head>
<body>
<p></p>
</body>
</html>

I get an error if I try the Transitional doctype, though, as I should per spec.
Comment 8 Aryeh Gregor 2009-11-12 16:17:23 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Currently, validator.nu raises an error, which is wrong per the spec.
> 
> I pasted the following in the text area on http://html5.validator.nu/ and I got
> a warning only.
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
> "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
> <html>
> <head>
> <title></title>
> </head>
> <body>
> <p></p>
> </body>
> </html>
> 
> I get an error if I try the Transitional doctype, though, as I should per spec.

Okay, it seems like http://validator.nu/ raises an error (followed by lots more errors because it interprets it as XHTML 1.0 Strict), while http://html5.validator.nu/ raises only a warning.  So that's okay, but the warning message could maybe be clearer about why you might want to do this.
Comment 9 Aryeh Gregor 2009-12-16 12:06:31 UTC
Note that the solution Wikipedia is using is just to use an XHTML Strict doctype.  The spec permits this and there are no other practical solutions the spec could possibly permit, AFAIK, so at this point I think it would be fine if this were just closed.
Comment 10 Ian 'Hixie' Hickson 2010-01-10 11:01:14 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: no spec change
Rationale: Closing per previous comment.