Public Identifiers for entity resolution in XHTML

W3C Working Draft 28 February 2013

This version:
http://www.w3.org/TR/2013/WD-xhtml-pubid-20130228/
Latest version:
http://www.w3.org/TR/xhtml-pubid/
Latest editor's draft:
http://www.w3.org/2003/entities/2007doc/xhtml-pubid.html
Editor:
David Carlisle, NAG

Abstract

This document adds an additional public identifier that should be recognised by XHTML user agents and cause the HTML character entity definitions to be loaded. Unlike the identifiers already listed by the [HTML5] specification, the identifier added by this extension references the set of definitions that is used by HTML.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This document was published by the HTML Working Group as a First Public Working Draft. If you are not a HTML working group member and wish to make comments regarding this document please send them to public-html-comments@w3.org (subscribe, archives). If you are a HTML working group member and wish to make comments regarding this document, please send them to public-html@w3.org (subscribe, archives). All feedback is welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

1 Introduction
2 Addition to the HTML specification
3 Normative references
4 Background and Rationale (Non-Normative)


1 Introduction

This specification is an extension to the HTML5 specification [HTML5]. It defines an additional Public Identifier that a conforming User agent should recognise and trigger the loading of the entities defined in [HTML5] unlike the Identifiers listed in the current HTML specification.

The key words must, must not, required, should, should not, recommended, may, and optional in this specification are to be interpreted as described in [RFC2119].

2 Addition to the HTML specification

[HTML5] Section 9.2 currently defines the following list of public identifiers that cause a pre-defined list of entity definitions to be loaded:

The only change proposed by this extension is to add a further identifier to the list such that it reads:

Note the above list includes one other editorial change from the current HTML specification. The white space in each of the Identifiers should be space character U+0020, however the current HTML specification uses non-breaking space U+00A0 as shown in the first list above; this does not work with cutting and pasting into a document.

3 Normative references

[HTML5]
Robin Berjon, Travis Leithead, Erika Doyle Navara, Edward O'Connor, Silvia Pfeiffer. HTML 5 W3C Candidate Recommendation 08 November 2012 URL: http://www.w3.org/html/wg/drafts/html/CR/Overview.html
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Internet RFC 2119. URL: http://www.ietf.org/rfc/rfc2119.txt
[XMLEntities]
David Carlisle, Patrick Ion. XML Entity Definitions for Characters W3C Recommendation 01 April 2010 URL: http://www.w3.org/TR/xml-entity-names/
[Bug13409]
David Carlisle, Bug 13409, Defining Entity references for characters in XHTML. URL: https://www.w3.org/Bugs/Public/show_bug.cgi?id=13409

4 Background and Rationale (Non-Normative)

A major reason to use the XHTML rather than HTML syntax for a document is to process the document using an XML-based toolchain, as well as in a browser. All the existing public identifiers listed identify DTDs that contain incompatible entity sets, thus unless processed using a non-standard configuration or catalogue that overrides the DTD then data corruption or fatal errors due to undefined entities will occur.

The Identifier -//W3C//ENTITIES HTML MathML Set//EN//XML Is the Public Identifier of the entity declaration file http://www.w3.org/2003/entities/2007/htmlmathml-f.ent defined by the XML Entity Definitions for Characters Recommendation [XMLEntities]. The entity definition file used by HTML is extracted from the same source (unicode.xml) so using this file ensures that XML and HTML processing of entity references are as similar as possible.

Authors (and authoring systems of XHTML documents) may prefer to use no DTD and just use numeric character references rather than entity references for characters, for instance, café or café rather that café. If they wish to use the café form then they should use the form

 <!DOCTYPE html PUBLIC
         "-//W3C//ENTITIES HTML MathML Set//EN//XML"
         "some/path/htmlmathml-f.ent"
       >

Where some/path/htmlmathml-f.ent is a URL to a copy of http://www.w3.org/2003/entities/2007/htmlmathml-f.ent.

Benefits

If a user agent supports the extended list specified in this document then authors may use entity references such as &langle; or &loang; and the document will be processed using the same characters (&#x27E8; and &#x27EC;) in XHTML agents and general XML processing tools.

There are similar problems with all the Identifiers in the list currently specified by HTML. Appendix C of [XMLEntities] lists all the changes between the HTML5 Entity set and the earlier XHTML 1 and MathML2 sets.

Risks

The main risk of adding a new recognised Identifier is that older HTML User Agents, and other User Agents not conforming to this specification will not recognise the Identifier. This is unavoidable, but is no worse than the current situation where as noted in the previous section, the identifiers that are recognised can still lead to undefined entities and worse, silent data corruption. The list (even as amended by this extension specifcation) is not compatible with the behaviour of all legacy user agents. A good example of this is the XHTML version of Chapter 2 of the MathML2 Recommendation This was processed by all relevant user agents at the time of its publication but is rejected as not well formed by user agents conforming to HTML. It would still be rejected by User Agents conforming to this extension. However, this extension gives a possibility of changing the document to use a declaration that will work for current XHTML and XML parsers, a possibility that is not there with HTML as currently published.

It could be noted that a form compatible with this extension specification, and at least some older User Agents such as Firefox 1–3, or Internet Explorer 6+, is:

<!DOCTYPE html PUBLIC
         "-//W3C//ENTITIES HTML MathML Set//EN//XML"
         "mathml.dtd"
       >
where mathml.dtd is the copy of htmlmathml-f.ent as recommended above. Early Firefox (and Mozilla/Netscape) would use their bundled entity declarations for this usage, and Internet explorer would fetch the specified DTD. It is only since the publication of the early drafts of HTML5 that browser behaviour has changed, so in Firefox 4+ and browsers of a similar era it is not possible to reference an HTML5 compatible set of entity declarations in a way that causes those entities to be defined in an XHTML5 user agent. This was raised as [Bug13409] on the HTML5 specification.