Bug 17798 - Defining Entity references for characters in XHTML.
Defining Entity references for characters in XHTML.
Status: RESOLVED WONTFIX
Product: WHATWG
Classification: Unclassified
Component: HTML
unspecified
Other other
: P3 normal
: Unsorted
Assigned To: Ian 'Hixie' Hickson
contributor
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-18 04:36 UTC by contributor
Modified: 2012-07-22 22:53 UTC (History)
8 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description contributor 2012-07-18 04:36:59 UTC
This was was cloned from bug 13409 as part of operation convergence.
Originally filed: 2011-07-28 15:47:00 +0000
Original reporter: David Carlisle <davidc@nag.co.uk>

================================================================================
 #0   David Carlisle                                  2011-07-28 15:47:01 +0000 
--------------------------------------------------------------------------------
http://www.w3.org/TR/2011/WD-html5-20110525/the-xhtml-syntax.html#parsing-xhtml-documents

Parsing XHTML documents

Defines a list of 9 (fairly obsolete) DTD Public identifiers for which
the XML parser should  fetch a predefined set of (html5) character
entity definitions.

This list encourages the use of non conforming (for XHTML5) DTD, it
would be preferable if the html5 entity definitions were also loaded
for the standard HTML5 doctype declaration <!DOCTYPE html> or all
doctypes, thus removing the need for this list. (XML parsers not in a
browser could in most cases be configured to behave this way using a
suitable catalog, but as in the current text this can be a "should"
requirement to allow XML parsers that do not read the definitions to be
conforming.
================================================================================
 #1   Henri Sivonen                                   2011-07-29 14:58:26 +0000 
--------------------------------------------------------------------------------
Status: Rejected
Change Description: no spec change
Rationale: Using entity references in an XML document bearing a doctype whose public id is not on the list of special public ids currently listed in the spec wouldn't be compatible with deployed browsers. In other words, such content wouldn't Degrade Gracefully. The allegation that the list encourages the use of a non-conforming (for XHTML5) DTD is incorrect. DTDs are part of the XML syntax and, thus, are part of the syntax layer that XHTML5 is built on top of. Since XML doesn't provide an abstraction for specs layered on top of XML to place constraints on the XML layer, XHTML5 has no jurisdiction to place constraints on the XML layer (even though it suggests a particular entity resolver configuration). XHTML5 conformance requirement cannot restrict the use of a DTD, so any DTD is conforming for XHTML5. Thus, the spec isn't encouraging the use of DTDs that are non-conforming for XHTML5.
================================================================================
 #3   David Carlisle                                  2011-08-08 22:32:37 +0000 
--------------------------------------------------------------------------------
(In reply to comment #1)
> EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are
> satisfied with this response, please change the state of this bug to CLOSED. If
> you have additional information and would like the editor to reconsider, 

Reopening as I note in the decision process this has to be done within 2 weeks, although I'm traveling so this response is less complete than otherwise might have been the case, will respond more fully if necessary later.

> Status: Rejected
> Change Description: no spec change
> Rationale: Using entity references in an XML document bearing a doctype whose
> public id is not on the list of special public ids currently listed in the spec
> wouldn't be compatible with deployed browsers.

I do not believe that browser behaviour was particularly consistent here and so the blanket statement that not using the exact list specified is incompatible with deployed browsers is misleading. IE for example by default fetched external dtd when parsing xml, so would define any entities
defined in any specified dtd.

The xml version of the mathml2 spec for example does not use any of these public identifiers (it uses SYSTEM "mathml.dtd" and uses entity references which did get resolved in IE and firefox at least. (The mathml3 version doesn't uses any entity references for characters). Firefox nightly (and I assume any other using the html(5) spec gives a fatal parse error on

http://www.w3.org/TR/MathML2/chapter3.xml

which did work in all versions up to firefox 3

and still works in IE9/MathPlayer

> In other words, such content
> wouldn't Degrade Gracefully. The allegation that the list encourages the use of
> a non-conforming (for XHTML5) DTD is incorrect.

It encourages the use of DTD that specify a syntax for previous versions of (x)html which is going to be pretty confusing for users especially in toolchains where the presence of a doctype triggers validation.

. DTDs are part of the XML syntax
> and, thus, are part of the syntax layer that XHTML5 is built on top of. Since
> XML doesn't provide an abstraction for specs layered on top of XML to place
> constraints on the XML layer, XHTML5 has no jurisdiction to place constraints
> on the XML layer (even though it suggests a particular entity resolver
> configuration). XHTML5 conformance requirement cannot restrict the use of a
> DTD, so any DTD is conforming for XHTML5.

I don't know why you say xhtml5 can not restrict the dtd, all previous versions have restricted the dtd that are considered valid (or conforming to use the current terminology) xhtml. 

> Thus, the spec isn't encouraging the
> use of DTDs that are non-conforming for XHTML5.

That's playing with words. As noted above it encourages the use of DTD that are aimed at older versions of html, and it removes the feature (which is cosmetic, but popular) or using named character references from any document that uses DTD that are targeted at current versions of html or which uses the recommeneded versionless doctype <!DOCTYPE html>.
================================================================================
 #4   Henri Sivonen                                   2011-08-31 12:29:59 +0000 
--------------------------------------------------------------------------------
(In reply to comment #3)
> It encourages the use of DTD that specify a syntax for previous versions of
> (x)html which is going to be pretty confusing for users especially in
> toolchains where the presence of a doctype triggers validation.

It's confusing, sure, but the Degrade Gracefully principle limits what the magic can look like.

> I don't know why you say xhtml5 can not restrict the dtd, all previous versions
> have restricted the dtd that are considered valid (or conforming to use the
> current terminology) xhtml. 

The previous specs had layering violations.
================================================================================
 #5   David Carlisle                                  2011-08-31 12:53:17 +0000 
--------------------------------------------------------------------------------
Since there was (and is currently) very little interoperability in how browsers treated external dtd references in this situation, there is the possibility of defining at least one form that allows the entities to be defined but doesn't reference an obsolete version of html or mathml.

The current spec (and its implementation in Firefox 4+) are not compatible with existing content, for example

http://www.w3.org/TR/MathML2/chapter3.xml


which as noted in comment #3 was compatible with Firefox [123] and is compatible with IE9 (and doesn't give a fatal parse error in Opera 11.5 or 12.00 pre-alpha, although entities are rendered as & n b s p ; for both opera versions)

Given that the existing implementations are not compatible with each other, or the current wording, it is hard to argue that interoperability or even the degrade gracefully principle would be greatly harmed by _allowing_ the entities to be defined for <!DOCTYPE html>.

Conversely, in many XML toolchains (including IE's xml parser) specifying an old dtd as required by the current wording will essentially corrupt the document as the system will use that definition and so replace entity references by their mathml2/xhtml1 definitions which are in a few cases (mainly relating to unicode normalization or later additions to Unicode) different from the definitions in html(5)/mathml3.
================================================================================
 #6   Ian 'Hixie' Hickson                             2011-12-02 17:18:17 +0000 
--------------------------------------------------------------------------------
Status: Rejected
Change Description: no spec change
Rationale: If the argument is that we should add URLs to the list of magic DTD URLs in the spec, then I'm happy to do that; all you have to do is show that the URL you want to add already works in the majority of deployed browsers.

Other than that, the premise of this bug seems incorrect, as Henri has noted.
================================================================================
 #7   David Carlisle                                  2011-12-22 19:08:27 +0000 
--------------------------------------------------------------------------------
(In reply to comment #6)

> Status: Rejected
> Change Description: no spec change
> Rationale: If the argument is that we should add URLs to the list of magic DTD
> URLs in the spec, then I'm happy to do that; all you have to do is show that
> the URL you want to add already works in the majority of deployed browsers.
> 
> Other than that, the premise of this bug seems incorrect, as Henri has noted.


Reopening. Holding off from the somewhat adversarial escalation process for now, but may choose to do that later.

As noted in earlier comments this change broke existing pages (for those implementations that supported MathML entities previously) such as Firefox (prior to 4) and IE/MathPlayer.

For example

http://www.w3.org/TR/2010/PR-MathML3-20100810/chapter2.xml

rendered in all relevant browsers at the time it was published but fails with a fatal parse error if processed according to the current html spec.

The XHTML version of the final MathML3 rec broke in the same way, but we edited that one in place to work around this unresolved bug, removing the entity references.

The current spec makes it impossible to use the doctype to specify processing relevant to current documents for the xml toolchain, while still allowing the file to be parsed by an xhtml system.

Any fixed list would have similar defects although of course extending it would patch individual cases. 

My preference would be to specify that the data url with the entity definitions be loaded for any DOCTYPE with a PUBLIC id for a page served as application/xhtml+xml

This would allow easy use of catalogs to gain the same behaviour in standard xml toolchains, and would still allow an easy way for page authors to avoid the cost of entity processing, by not using a public ID or by using (say) application/xml.
================================================================================
 #8   John Thomas                                     2011-12-28 08:46:40 +0000 
--------------------------------------------------------------------------------
Why not allow <!DOCTYPE html PUBLIC "about:any public id" SYSTEM "any arbitrary url"> - browsers can skip to the ">" after hitting "about:" (I am likely terribly naive about how browsers work)? I'd prefer a solution that works for documents that don't serve with application/xhtml+xml as well as ones that do. Besides sloth (which in my defense is the greatest engine of efficiency, at least according to Cheaper by the Dozen), it would be nice if all basic parsing information for a document was in the document itself, but that's just my personal musings on things.

- John Thomas
================================================================================
 #9   Henri Sivonen                                   2012-01-09 08:58:56 +0000 
--------------------------------------------------------------------------------
That wouldn't Degrade Gracefully in existing browsers that hard-code a handful of public ids (and that set of ids doesn't contain "about:any public id").
================================================================================
 #10  David Carlisle                                  2012-01-09 09:20:04 +0000 
--------------------------------------------------------------------------------
The situation has always been (because of the freedom offered in the XML spec) that some browsers give fatal errors on documents that other browsers parse correctly. So "Degrade Gracefully" is going to be hard to achieve in general.
Therefore it is a good thing that the html spec specifies how the resolution of external dtd references should work, however the current specification makes no sense as it states that to include the correct set of entity references you may _not_ refer to those but must instead refer to various different, incompatible, sets. At an absolute minimum there must be a way to refer to the current entity set in a way that does not use the FPI of an older, incompatible set. Without this it is more or less inevitable that documents will be corrupted while being parsed by an xml system outside the browser.

More important than preserving the behaviour of legacy browsers is preserving existing content, and having a mechanism going forward that allows the specification of the correct set of entities in a way that works in browsers.

The current spec does not preserve the existing content (The xhtml versions MathML2 and MathML3 specs which worked in IE and Firefox 3 break as both of them supported SYSTEM ids of "mathml.dtd"). Because of the variability in legacy behaviour noted above though, in this case I'd accept breaking existing content if there was some acceptable thing to change it to.
================================================================================
 #11  Ian 'Hixie' Hickson                             2012-01-13 19:14:07 +0000 
--------------------------------------------------------------------------------
I'm happy to add new DTDs to the list; all you have to do is show that the URL you want to add already works in the majority of deployed browsers. To do that, please simply provide trivial test cases that use the DTD you want to have supported. It's not clear to me from this bug so far what exactly it is you want changed in the spec.
================================================================================
 #12  Julian Reschke                                  2012-01-13 19:23:23 +0000 
--------------------------------------------------------------------------------
Since when is "support by the majority of browsers" a requirement for something to be added to the spec?
================================================================================
 #13  David Carlisle                                  2012-01-13 23:46:54 +0000 
--------------------------------------------------------------------------------
(In reply to comment #11)
> I'm happy to add new DTDs to the list; all you have to do is show that the URL
> you want to add already works in the majority of deployed browsers.


Hicie, that's not really a reasonable precondition:-)

Deployed "current" browsers are converging on what it says in the spec, so finding one, let alone a "majority"  that does something different is hard (all pre-html5 browsers did things differently from the spec and differently from each other so "majority" usage doesn't really apply there either). Conversely _all_ non-browser xml parsers will, if you specify one of the DTDs listed in the HTML spec, load the DTDs so specified; which will mean that they load different incompatible definitions for the entities, thus silent data corruption can occur.

It is almost inevitable that xhtml documents will also be parsed by non-browser xml tools (otherwise you'd just use html) thus it's imperative that there be a way of specifying a document that is parsed with the current entity set in both browser and non browser use.
 
>  To do that,
> please simply provide trivial test cases that use the DTD you want to have
> supported.

The example of the MathML spec has been given several times already, that parsed interoperably with IE and Firefox (and amaya and netscape) thus all MathML-aware browsers of its era. It's notable that this does not have a PUBLIC ID at all (so hard to fit into your current static list of PUBLIC Ids)
It had a SYSTEM id to "mathml.dtd" this worked in IE as it loads external DTD and worked in netscape/firefox as it recognised SYSTEM URLs ending in mathml in addition to certain magic public IDs, to load its mathml support.

> It's not clear to me from this bug so far what exactly it is you
> want changed in the spec.

As stated earlier what I think would be best if the fixed list of PUBLIC Ids were removed from the spec and instead the spec should state that _any_ external dtd be resolved to the data url of the entity definitions.
However I have asked for that before and you seem to keep pushing back on that, however at a minimum there has to be some way of specifying an HTML-compatible entity set so that the file is not corrupted when parsed by an off-browser xml parser. Thus the simplest minimal (but far from optimal) change that could be made to the spec would be to add the PUBLIC id for the html/mathml entity set.
ie the one at

http://www.w3.org/2003/entities/2007/htmlmathml-f.ent

which claims it has the PUBLIC ID

    Public identifier: -//W3C//ENTITIES HTML MathML Set//EN//XML
================================================================================
 #14  Ian 'Hixie' Hickson                             2012-02-01 00:44:08 +0000 
--------------------------------------------------------------------------------
Adding new DTDs is not backwards-compatible and provides no new abilities, hence the requirement that they be already-supported DTDs.
================================================================================
 #15  David Carlisle                                  2012-02-01 09:31:02 +0000 
--------------------------------------------------------------------------------
I'm sorry Ian, but that description just doesn't match the reality of the situation.

The description of any DTD in that list being "new" is highly mozilla-specific, and doesn't even match mozilla's behaviour.

pre html5 only mozilla based browsers special cased the mathml dtd PUBLIC IDs in that way, IE (in its xml parsing mode) fetched any specified dtd and others (as far as I know) didn't fetch the mathml ones.

Mozilla's list as implemented included useful version-agnostic dtd support in addition to the list of PUBLIC ids, as it special cased any dtd that had "mathml" in its SYSTEM ID. This is why the MathML 2 and MathML3 specs used to render perfectly OK, but now give fatal parse errors.  I don't see how you could cause that to happen, then claim "backward compatibility" as an excuse for not fixing it.
================================================================================
 #16  webreac                                         2012-05-04 13:22:02 +0000 
--------------------------------------------------------------------------------
Please stop your cracks !
The following page fails with the error "Entity 'times' not defined" on Google chrome when served as application/xhtml+xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en-us">
<head>
<meta charset="UTF-8" />
</head>
<body>Hello World &times;</body>
</html>

People need to use the standard "HTML5 doctype declaration" AND "application/xhtml+xml". Because of this buggy specification, html5 entities do not work anymore.

This is not serious ! The HTML5 specification need to be consistent ! You must choose between the only two possible positions:
- Either the "standard HTML5 doctype declaration" implies the loading of "html5 entities"
- Either you declare that the "standard HTML5 doctype declaration" should be different when using xhtml !
================================================================================
Comment 1 Ian 'Hixie' Hickson 2012-07-19 23:17:59 UTC
Re point 15; point 14 stands. I am unmoved by the points made in point 15.

Re point 16, the HTML syntax is irrelevant to the XML mode.