W3C

XML 1.0 Fourth Edition Specification Errata

Abstract

This document records all known errors in the Fourth Edition of the Extensible Markup Language (XML) 1.0 Specification ; for updates see the latest version.

The errata are numbered, classified as Substantive or Editorial, and listed in reverse chronological order of their date of publication in each category. Changes to the text of the spec are indicated thus: deleted text, new text, modified text . Substantive corrections are proposed by the XML Core Working Group, which has consensus that they are appropriate; they are not to be considered normative until approved by a Call for Review of Proposed Corrections or a Call for Review of an Edited Recommendation.

Please email error reports to xml-editor@w3.org.

Substantive errata

Errata as of 2008-01-18

E11

Section 2.2 Characters

In the paragraph following production [2], change the reference to [Unicode3] to point to [Unicode]:

The mechanism for encoding character code points into bit patterns may; vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1 [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

Amend the first paragraph of the following Note to read:

Note:

Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

Section 4.2.2 External Entities

In the first item of the numbered list, change the reference to [Unicode3] to point to [Unicode]:

  1. Each character to be escaped is represented in UTF-8 [Unicode] as one or more bytes.

Section 4.3.3 Character Encoding in Entities

Amend the second paragraph to read:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO10646-2000], section 16.8 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

Amend the next-to-last paragraph to read:

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregularill-formed code unit sequences, as defined in section 3.9 of Unicode 3.1 [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Section A.1 Normative References

Amend the [Unicode] entry to read:

Unicode
The Unicode Consortium. The Unicode Standard, Version 5.0.0, Reading, Mass.: Addison-Wesley Developers Press, 1996defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0).

Remove the [Unicode3] entry.

Unicode3
Rationale
These changes bring XML in line with the most up-to-date version of the Unicode specification.

E10

Section 2.8 Prolog and Document Type Declaration

Alter production [26] so that it reads:

[26]   VersionNum   ::=   '1.' [0-9]+

Add a new paragraph immediately after production [27] as follows:

Even though the VersionNum production matches any version number of the form '1.x', XML 1.0 documents SHOULD NOT specify a version number other than '1.0'.

Note:

When an XML 1.0 processor encounters a document that specifies a 1.x version number other than '1.0', it will process it as a 1.0 document. This means that an XML 1.0 processor will accept 1.x documents provided they do not use any non-1.0 features.

Rationale
This change effectively reverses a change introduced by erratum E38 to the second edition, in order to allow parsers to attempt forward-compatible processing of post-1.0 documents.

E09

Section 2.3 Common Syntactic Constructs

Delete the following paragraph:

Characters are classified for convenience as letters, digits, or other characters. A letter consists of an alphabetic or syllabic base character or an ideographic character. Full definitions of the specific characters in each class are given in B Character Classes.

Restructure the introduction of Name and Nmtoken to read as follows:

An Nmtoken (name token) is any mixture of name characters.

[Definition: A Name is an Nmtoken with a restricted set of initial characters.] Disallowed initial characters for Names include digits, diacritics, the full stop and the hyphen.

Names beginning with the string "xml", or with any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.

Note:

The Namespaces in XML Recommendation [XML Names] assigns a meaning to names containing colon characters. Therefore, authors should not use the colon in XML names except for namespace purposes, but XML processors must accept the colon as a name character.

Replace the group of productions [4] to [8], including the "Names and Tokens" heading, with the following:

The first character of a Name MUST be a NameStartChar, and any other characters MUST be NameChars; this mechanism is used to prevent names from beginning with European (ASCII) digits or with basic combining characters. Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names. See J Suggestions for XML Names for suggestions on the creation of names.

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references.

Names and Tokens
[4]   NameStartChar   ::=   ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]   NameChar   ::=    NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   Name   ::=    NameStartChar (NameChar)*
[6]   Names   ::=    Name (#x20 Name)*
[7]   Nmtoken   ::=   (NameChar)+
[8]   Nmtokens   ::=    Nmtoken (#x20 Nmtoken)*

Section B Character Classes

Add a new paragraph at the beginning of this Appendix, as follows:

Because of changes to productions [4] and [5], the productions in this Appendix are now orphaned and not used anymore in determining name characters. This Appendix may be removed in a future edition of this specification; other specifications that wish to refer to the productions herein should do so by means of a reference to the relevant production(s) in the Fourth Edition of this specification.


Add a new Appendix as follows:

J Suggestions for XML Names (Non-Normative)

The following suggestions define what is believed to be best practice in the construction of XML names used as element names, attribute names, processing instruction targets, entity names, notation names, and the values of attributes of type ID, and are intended as guidance for document authors and schema designers. All references to Unicode are understood with respect to a particular version of the Unicode Standard greater than or equal to 5.0; which version should be used is left to the discretion of the document author or schema designer.

The first two suggestions are directly derived from the rules given for identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 5.0, and exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned codepoints, and white space characters. The other suggestions are mostly derived from Appendix B in previous editions of this specification.

  1. The first character of any name SHOULD have a Unicode property of ID_Start, or else be '_' #x5F.

  2. Characters other than the first SHOULD have a Unicode property of ID_Continue, or be one of the characters listed in the table entitled "Characters for Natural Language Identifiers" in UAX #31, with the exception of "'" #x27 and "’" #x2019.

  3. Ideographic characters which have a canonical decomposition (including those in the ranges [#xF900-#xFAFF] and [#x2F800-#x2FFFD], with 12 exceptions) should not be used in names.

  4. Characters which have a compatibility decomposition (those with a "compatibility formatting tag" in field 5 of the Unicode Character Database -- marked by field 5 beginning with a "<") should not be used in names. This suggestion does not apply to #x0E33 THAI CHARACTER SARA AM or #x0EB3 LAO CHARACTER AM, which despite their compatibility decompositions are in regular use in those scripts.

  5. Combining characters meant for use with symbols only (including those in the ranges [#x20D0-#x20EF] and [#x1D165-#x1D1AD]) should not be used in names.

  6. The interlinear annotation characters ([#xFFF9-#xFFFB]) should not be used in names.

  7. Variation selector characters should not be used in names.

  8. Names which are nonsensical, unpronounceable, hard to read, or easily confusable with other names should not be employed.

Rationale

Since XML 1.1 became a W3C Recommendation in August 2006, there has been a substantial uptake of it as a peer of XML 1.0 in new and ongoing W3C work. This is appropriate, as XML 1.1 was explicitly not designed to replace XML 1.0, but to supplement it for the benefit of various groups against which XML 1.0 had unjustly, but unintentionally, discriminated.

However, there are very few XML 1.1 documents in the wild. The XML Core WG believes this to be the result of a vicious circle, in which widely distributed XML parsers do not support 1.1 because the parser authors believe that few document authors will use it. This becomes a self-fulfilling prophecy, as those who would benefit from XML 1.1 are rightfully concerned that documents written in it will not be widely acceptable.

After considering various other means by which to achieve the main goal of XML 1.1, that is, to deliver on XML's original promise of universality across all the world's languages, the XML Core WG has drafted this erratum to change XML 1.0 to relax the restrictions on names, thereby providing in XML 1.0 the major end user benefit currently achievable only by using XML 1.1, and completing the decoupling of XML from specific versions of Unicode.

To quote the XML 1.1 Recommendation:

The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Third Edition of 2004, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 4.0 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0.
The overall philosophy of names has changed since XML 1.0. Whereas XML 1.0 provided a rigid definition of names, wherein everything that was not permitted was forbidden, XML 1.1 names are designed so that everything that is not forbidden (for a specific reason) is permitted. Since Unicode will continue to grow past version 4.0, further changes to XML can be avoided by allowing almost any character, including those not yet assigned, in names.

Since then, Unicode has continued its efforts to add scripts and characters in order to improve or add support for the world's languages and writing systems. This effort that is by no means complete. The changes since the XML 1.0 name character inventory was fixed encompass a variety of additions to the Unicode standard, and include support for:

  • additional scripts, including Ethiopic, Cherokee, Canadian Syllabics, Khmer, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Syloti Nagri, N'Ko, and Tifinagh
  • many additional Han ideographs (used predominantly for Chinese)
  • additional characters for scripts that were incompletely understood at the time 2.0 was released, notably scripts native to South Asia

This change to XML 1.0 relaxes the restrictions on names, used not only for element and attribute names but also identifiers and enumerated attribute values. Those who prefer to retain the constraints on names from the previous version of XML 1.0 in their documents will be free to do so, but those who wish to use names that incorporate these additional characters will be able to do so.

Errata as of 2007-12-05

E06

Section 4.3.3 Character Encoding in Entities

Add a new paragraph following the second paragraph, to read:

If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16.
Rationale
Provide unambiguous behavior by working around the inherent ambiguity of the BOM in Unicode.

Errata as of 2007-08-15

E01

Section 1.1 Origin and Goals

Amend the first paragraph after the list of goals, so that it reads:

This specification, together with associated standards (Unicode [Unicode] and ISO/IEC 10646 [ISO10646] for characters, Internet RFC 30664646 [RFC1766] and the Language Subtag Registry [IANA-LANGCODES] for language identification tags, ISO 639 [ISO639] for language name codes, and ISO 3166 [ISO3166] for country name codes), provides all the information necessary to understand XML Version 1.0; and construct computer programs to process it.
Section A.1 Normative References

Change the [IETF RFC 3066] entry so that it points to IETF RFC 4646.

Section A.2 Other References

Change the [IANA-LANGCODES] entry so that it points to the new registry at http://www.iana.org/assignments/language-subtag-registry.

Rationale
RFC 3066 has been replaced by RFC 4646. The old registry pointed to by the IANA-LANGCODES entry is now stale and closed. With the new registry, reference to ISO 639 and ISO 3166 is no longer necessary (and may even be harmful in the future, because of stability concerns).

E02

Section 2.2 Characters

Amend the [#xFDD0-#xFDDF] range in the list of discouraged characters to read [#xFDD0-#xFDEF].

Rationale
"#xFDDF" was a typo, as can be ascertained by consulting the Unicode standard in which the whole [#xFDD0-#xFDEF] range was introduced as a block in one version (3.1) to serve as "non-characters" for internal use.

Editorial errata

Errata as of 2008-01-18

E08

Section 4.1 Character and Entity References

Change the first sentence of the text of the Entity Declared VC as follows:

In a document with an external subset or parameter entity references, if the document is not standalone (either "standalone='no'" is specified or there is no standalone declaration), then the Name given in the entity reference MUST match that in an entity declaration.
Rationale
The existing wording was ambiguous and did not explicitly address the case of an absent standalone declaration.

Errata as of 2007-12-05

E07

Section D Expansion of Entity and Character References

Append the following at the end of the Appendix:

In the following example

<!DOCTYPE foo [ 
<!ENTITY x "&lt;"> 
]> 
<foo attr="&x;"/>

the replacement text of x is the four characters "&lt;" because references to general entities in entity values are bypassed. The replacement text of lt is a character reference to the less-than character, for example the five characters "&#60;" (see 4.6 Predefined Entities). Since neither of these contains a less-than character the result is well-formed.

If the definition of x had been

<!ENTITY x "&#60;">

then the document would not have been well-formed, because the replacement text of x would be the single character "<" which is not permitted in attribute values (see WFC: No < in Attribute Values).

Rationale
This is an editorial clarification of a case that remained confusing.

E05

Section 4.3.3 Character Encoding in Entities

Change the second sentence of the first paragraph to read:

The terms "UTF-8" and "UTF-16" in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16. related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.
Section F Autodetection of Character Encodings

Change the last sentence of the first paragraph to read:

We will consider the first case first. these cases in turn.
Rationale
The former reading in 4.3.3 still caused some confusion, especially with respect to UTF-16BE and UTF-16LE.

E04

Section 5.2 Using XML Processors

Amend the last sentence of the second item of the bulleted list so that it reads:

For example, a non-validating processor may fail to normalize attribute values, include the replacement text of internal entities, or supply default attribute values, where doing so depends on having read declarations in external or parameter entities, or in the internal subset after an unread parameter entity reference.
Rationale
Improve the informativeness of the example sentence.

Errata as of 2007-09-25

E03

Section 4.1 Character and Entity References

Remove a duplicate "to" from the last paragraph of the description of the "Entity Declared" WFC, so that it reads:

Note that non-validating processors are not obligated to to read and process entity declarations occurring in parameter entities or in the external subset; for such documents, the rule that an entity must be declared is a well-formedness constraint only if standalone='yes'.
Rationale
This was a typo.

Last updated $Date: 2008/11/18 16:33:50 $ by $Author: ht $

xml-editor