Character Entities: An XML Core WG View

A consensus statement from the XML Core WG as of 2002 October 23

Status

This public document describes the XML Core Working Group consensus opinion on the given topic. Other than agreement within the Working Group, this document has undergone no other review and is not an official W3C publication of any kind.

Abstract

"Character entities" is an informal name for XML internal general entities (whether internally or externally declared) that provide a name for a single Unicode character. Character references, whether decimal or hexadecimal, offer the same power as character entities, but not the same ease of use. Therefore the ability to use character entities is recognized as important. However, there is absolutely no need to introduce a new mechanism into XML to declare them.

Introduction

"Character entities" is an informal name for XML internal general entities (whether internally or externally declared) that provide a name for a single Unicode character. (They may involve more than one Unicode codepoint, as in the case of an unusual combination of base character and combining character such as LATIN CAPITAL LETTER G with COMBINING GRAVE.) They are a valuable feature of hand-created XML. They allow characters that cannot easily be typed to be entered into documents, and proofread and modified as well.

Character references no alternative

Character references, whether decimal or hexadecimal, offer the same power as character entities, but not the same ease of use. The numbers are more difficult to remember and even more difficult to proofread. If there are several character references close together, it may be quite hard to remember which one is the one that needs modification. An erroneous character entity causes a well-formedness or validity error, whereas an erroneous character reference typically just introduces some random character into the document.

Existing methods suffice

For these reasons character entities are important. However, there is absolutely no need to introduce a new mechanism into XML to declare them. The existing mechanism, DTDs, is entirely adequate to the purpose. Although some subsets of XML have outlawed DTDs in the name of interoperability, all conforming XML processors (parsers) must be able to recognize at least some DTD information, specifically including the declaration of character entities in the internal subset.

In addition, all but the most limited XML processors can also process the external DTD subset at least to the extent of being able to recognize and act on character entity declarations. At worst, then, the character entities actually used in a given document (generally a small subset of those available) can be declared in the internal subset, and are 100% interoperable across processors.

Scoping, namespacing not required

It is true that entity declarations do not provide for anything analogous to namespaces: they exist in a single global space, and are scoped over the entire document. However, different XML applications such as XHTML and MathML do not need to declare differently named entities for the same characters. Most character names have already been standardized by ISO, and these names should be and are used wherever possible.

Placing lists of character entity declarations in separate files, and then referencing them from the internal subset as external parameter entities, is the appropriate way to specify multiple sets of character entities.

Local character entities

When it is necessary, local character entity declarations can be devised, and have the appropriate single-document scope. The rules for multiple entity declarations in DTDs (earlier beats later, internal subset beats external subset) suffice for the needs of local names: standard names can be redefined if necessary, and novel names for currently unnamed characters can be added.

General character naming

People have sometimes asked for a more general character naming mechanism, equivalent in power to SGML SDATA declarations, allowing for the use of characters that are not encoded in Unicode (either by policy or because the encoding effort has not yet reached them). There is no need for such a facility, because of the Unicode Private Use Area (PUA). This provides a supply of 6400 + 65534 * 2 characters, far more than any application will need (even Egyptian hieroglyphs have only 7000-8000 characters in all).

The appropriate way to make use of such characters is to define character entities which give human-readable names to the PUA characters for use in authoring. The question of supplying machine-understandable semantics for PUA characters is an open research question: the Unicode defaults treat PUA characters like ideographs, since the most frequent use for them is to represent unusual ideographs for use in personal names.

$Revision: 1.1 $ by $Author: PaulGrosso $
$Date: 2002/10/23 18:04:29 $