Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark and document use rules apply.
This document adds an additional public identifier that should be recognised by XHTML user agents and cause the HTML character entity definitions to be loaded. Unlike the identifiers already listed by the [HTML5] specification, the identifier added by this extension references the set of definitions that is used by HTML.
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This document was published by the HTML Working Group as a First Public Working Draft. If you are not a HTML working group member and wish to make comments regarding this document please send them to public-html-comments@w3.org (subscribe, archives). If you are a HTML working group member and wish to make comments regarding this document, please send them to public-html@w3.org (subscribe, archives). All feedback is welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
1 Introduction
2 Addition to the HTML specification
3 Normative references
4 Background and Rationale (Non-Normative)
This specification is an extension to the HTML5 specification [HTML5]. It defines an additional Public Identifier that a conforming User agent should recognise and trigger the loading of the entities defined in [HTML5] unlike the Identifiers listed in the current HTML specification.
The key words must, must not, required, should, should not, recommended, may, and optional in this specification are to be interpreted as described in [RFC2119].
[HTML5] Section 9.2 currently defines the following list of public identifiers that cause a pre-defined list of entity definitions to be loaded:
-//W3C//DTD XHTML 1.0 Transitional//EN
-//W3C//DTD XHTML 1.1//EN
-//W3C//DTD XHTML 1.0 Strict//EN
-//W3C//DTD XHTML 1.0 Frameset//EN
-//W3C//DTD XHTML Basic 1.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
-//W3C//DTD MathML 2.0//EN
-//WAPFORUM//DTD XHTML Mobile 1.0//EN
The only change proposed by this extension is to add a further identifier to the list such that it reads:
-//W3C//DTD XHTML 1.0 Transitional//EN
-//W3C//DTD XHTML 1.1//EN
-//W3C//DTD XHTML 1.0 Strict//EN
-//W3C//DTD XHTML 1.0 Frameset//EN
-//W3C//DTD XHTML Basic 1.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
-//W3C//DTD XHTML 1.1 plus MathML 2.0 plus SVG 1.1//EN
-//W3C//DTD MathML 2.0//EN
-//WAPFORUM//DTD XHTML Mobile 1.0//EN
-//W3C//ENTITIES HTML MathML Set//EN//XML
Note the above list includes one other editorial change from the current HTML specification. The white space in each of the Identifiers should be space character U+0020, however the current HTML specification uses non-breaking space U+00A0 as shown in the first list above; this does not work with cutting and pasting into a document.
A major reason to use the XHTML rather than HTML syntax for a document is to process the document using an XML-based toolchain, as well as in a browser. All the existing public identifiers listed identify DTDs that contain incompatible entity sets, thus unless processed using a non-standard configuration or catalogue that overrides the DTD then data corruption or fatal errors due to undefined entities will occur.
The Identifier
-//W3C//ENTITIES HTML MathML Set//EN//XML
Is the Public Identifier of the entity declaration file
http://www.w3.org/2003/entities/2007/htmlmathml-f.ent
defined by the XML Entity Definitions for Characters Recommendation [XMLEntities].
The entity definition file used by HTML is extracted from the same source
(unicode.xml) so using this file ensures that XML and HTML processing of
entity references are as similar as possible.
Authors (and authoring systems of XHTML documents) may prefer to
use no DTD and just use numeric character references rather than entity
references for characters, for instance, café
or
café
rather that café
.
If they wish to use the café
form then they should
use the form
<!DOCTYPE html PUBLIC "-//W3C//ENTITIES HTML MathML Set//EN//XML" "some/path/htmlmathml-f.ent" >
Where some/path/htmlmathml-f.ent
is a URL to a copy of http://www.w3.org/2003/entities/2007/htmlmathml-f.ent
.
If a user agent supports the extended list specified in this document then authors may use entity references such as ⟨
or
⟬
and the document will be processed using the same characters (⟨
and ⟬
) in XHTML agents and general XML processing tools.
-//W3C//DTD XHTML 1.1//EN
, then, in an XHTML User Agent, ⟨
and
⟬
are recognised as ⟨
and ⟬
however, if processed by a standard XML parser that
references an XHTML 1 DTD, ⟨
would be expanded to &x2329;
which is not in Unicode NFC form. If normalised, data corruption would occur as this character is normalised to the Asian punctuation character &x3008;
. For a document which references the XHTML 1 DTD and uses ⟬
the entity is undefined so this would be a fatal XML parsing error and the entire document would be rejected if processed with standard XML tools.-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN
, then, in an XHTML User Agent, ⟨
and
⟬
are recognised as above; however, if processed by a standard XML parser that
references the XHTML 1.1 plus MathML 2 DTD and uses ⟬
the entity would expand to the Asian Punctuation Character 〘
thus silent data modification would occur. (The character U+27EC was added to Unicode 5.1 after the HTML 1.1 Recommendation was published precisely to avoid this problem.There are similar problems with all the Identifiers in the list currently specified by HTML. Appendix C of [XMLEntities] lists all the changes between the HTML5 Entity set and the earlier XHTML 1 and MathML2 sets.
The main risk of adding a new recognised Identifier is that older HTML User Agents, and other User Agents not conforming to this specification will not recognise the Identifier. This is unavoidable, but is no worse than the current situation where as noted in the previous section, the identifiers that are recognised can still lead to undefined entities and worse, silent data corruption. The list (even as amended by this extension specifcation) is not compatible with the behaviour of all legacy user agents. A good example of this is the XHTML version of Chapter 2 of the MathML2 Recommendation This was processed by all relevant user agents at the time of its publication but is rejected as not well formed by user agents conforming to HTML. It would still be rejected by User Agents conforming to this extension. However, this extension gives a possibility of changing the document to use a declaration that will work for current XHTML and XML parsers, a possibility that is not there with HTML as currently published.
It could be noted that a form compatible with this extension specification, and at least some older User Agents such as Firefox 1–3, or Internet Explorer 6+, is:
<!DOCTYPE html PUBLIC "-//W3C//ENTITIES HTML MathML Set//EN//XML" "mathml.dtd" >where
mathml.dtd
is the copy of htmlmathml-f.ent
as recommended above.
Early Firefox (and Mozilla/Netscape) would use their bundled entity declarations for this usage, and Internet explorer would fetch the specified DTD. It is only since the publication of the early drafts of HTML5 that browser behaviour has changed, so in Firefox 4+ and browsers of a similar era it is not possible to reference an HTML5 compatible set of entity declarations in a way that causes those entities to be defined in an XHTML5 user agent. This was raised as [Bug13409] on the HTML5 specification.