W3C

XML 1.0 Fourth Edition Specification Errata

Abstract

This document records all known errors in the Fourth Edition of the Extensible Markup Language (XML) 1.0 Specification ; for updates see the latest version.

The errata are numbered, classified as Substantive or Editorial, and listed in reverse chronological order of their date of publication in each category. Changes to the text of the spec are indicated thus: deleted text, new text, modified text . Substantive corrections are proposed by the XML Core Working Group, which has consensus that they are appropriate; they are not to be considered normative until approved by a Call for Review of Proposed Corrections or a Call for Review of an Edited Recommendation.

Please email error reports to xml-editor@w3.org.

Substantive errata

Errata as of 2008-01-18

E11

Section 2.2 Characters

In the pagaraph following production [2], change the reference to [Unicode3] to point to [Unicode]:

The mechanism for encoding character code points into bit patterns may; vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1 [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

Amend the first paragraph of the following Note to read:

Note:

Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

Section 4.2.2 External Entities

In the first item of the numbered list, change the reference to [Unicode3] to point to [Unicode]:

  1. Each character to be escaped is represented in UTF-8 [Unicode] as one or more bytes.

Section 4.3.3 Character Encoding in Entities

Amend the second paragraph to read:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO10646-2000], section 16.8 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

Amend the next-to-last paragraph to read:

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregularill-formed code unit sequences, as defined in section 3.9 of Unicode 3.1 [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Section A.1 Normative References

Amend the [Unicode] entry to read:

Unicode
The Unicode Consortium. The Unicode Standard, Version 5.0.0, Reading, Mass.: Addison-Wesley Developers Press, 1996defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0).

Remove the [Unicode3] entry.

Unicode3

E10

Section 2.8 Prolog and Document Type Declaration

Alter production [26] so that it reads:

[26]   VersionNum   ::=   '1.' [0-9]+

Add a new paragraph immediately after production [27] as follows:

Even though the VersionNum production matches any version number of the form '1.x', XML 1.0 documents SHOULD NOT specify a version number other than '1.0'.

Note:

When an XML 1.0 processor encounters a document that specifies a 1.x version number other than '1.0', it will process it as a 1.0 document. This means that an XML 1.0 processor will accept 1.x documents provided they do not use any non-1.0 features.


E09

Section 2.3 Common Syntactic Constructs

Delete the following paragraph:

Characters are classified for convenience as letters, digits, or other characters. A letter consists of an alphabetic or syllabic base character or an ideographic character. Full definitions of the specific characters in each class are given in B Character Classes.

Replace the group of productions [4] to [8], including the "Names and Tokens" heading, with the following:

The first character of a Name MUST be a NameStartChar, and any other characters MUST be NameChars; this mechanism is used to prevent names from beginning with European (ASCII) digits or with basic combining characters. Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names. See J Suggestions for XML Names for suggestions on the creation of names.

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references.

Names and Tokens
[4]   NameStartChar   ::=   ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]   NameChar   ::=    NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   Name   ::=    NameStartChar (NameChar)*
[6]   Names   ::=    Name (#x20 Name)*
[7]   Nmtoken   ::=   (NameChar)+
[8]   Nmtokens   ::=    Nmtoken (#x20 Nmtoken)*

Section B Character Classes

Replace the entire Appendix with the following


Add a new Appendix as follows:

J Suggestions for XML Names (Non-Normative)

The following suggestions define what is believed to be best practice in the construction of XML names used as element names, attribute names, processing instruction targets, entity names, notation names, and the values of attributes of type ID, and are intended as guidance for document authors and schema designers. All references to Unicode are understood with respect to a particular version of the Unicode Standard greater than or equal to 5.0; which version should be used is left to the discretion of the document author or schema designer.

The first two suggestions are directly derived from the rules given for identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 5.0, and exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned codepoints, and white space characters. The other suggestions are mostly derived from Appendix B in previous editions of this specification.

  1. The first character of any name SHOULD have a Unicode property of ID_Start, or else be '_' #x5F.

  2. Characters other than the first SHOULD have a Unicode property of ID_Continue, or be one of the characters listed in the table entitled "Characters for Natural Language Identifiers" in UAX #31, with the exception of "'" #x27 and "’" #x2019.

  3. Ideographic characters which have a canonical decomposition (including those in the ranges [#xF900-#xFAFF] and [#x2F800-#x2FFFD], with 12 exceptions) should not be used in names.

  4. Characters which have a compatibility decomposition (those with a "compatibility formatting tag" in field 5 of the Unicode Character Database -- marked by field 5 beginning with a "<") should not be used in names. This suggestion does not apply to #x0E33 THAI CHARACTER SARA AM or #x0EB3 LAO CHARACTER AM, which despite their compatibility decompositions are in regular use in those scripts.

  5. Combining characters meant for use with symbols only (including those in the ranges [#x20D0-#x20EF] and [#x1D165-#x1D1AD]) should not be used in names.

  6. The interlinear annotation characters ([#xFFF9-#xFFFB]) should not be used in names.

  7. Variation selector characters should not be used in names.

  8. Names which are nonsensical, unpronounceable, hard to read, or easily confusable with other names should not be employed.

Errata as of 2007-12-05

E06

Section 4.3.3 Character Encoding in Entities

Add a new paragraph following the second paragraph, to read:

If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16.
Rationale
Provide unambiguous behavior by working around the inherent ambiguity of the BOM in Unicode.

Errata as of 2007-08-15

E01

Section 1.1 Origin and Goals

Amend the first paragraph after the list of goals, so that it reads:

This specification, together with associated standards (Unicode [Unicode] and ISO/IEC 10646 [ISO10646] for characters, Internet RFC 30664646 [RFC1766] and the Language Subtag Registry [IANA-LANGCODES] for language identification tags, ISO 639 [ISO639] for language name codes, and ISO 3166 [ISO3166] for country name codes), provides all the information necessary to understand XML Version 1.0; and construct computer programs to process it.
Section A.1 Normative References

Change the [IETF RFC 3066] entry so that it points to IETF RFC 4646.

Section A.2 Other References

Change the [IANA-LANGCODES] entry so that it points to the new registry at http://www.iana.org/assignments/language-subtag-registry.

Rationale
RFC 3066 has been replaced by RFC 4646. The old registry pointed to by the IANA-LANGCODES entry is now stale and closed. With the new registry, reference to ISO 639 and ISO 3166 is no longer necessary (and may even be harmful in the future, because of stability concerns).

E02

Section 2.2 Characters

Amend the [#xFDD0-#xFDDF] range in the list of discouraged characters to read [#xFDD0-#xFDEF].

Rationale
"#xFDDF" was a typo, as can be ascertained by consulting the Unicode standard in which the whole [#xFDD0-#xFDEF] range was introduced as a block in one version (3.1) to serve as "non-characters" for internal use.

Editorial errata

Errata as of 2008-01-18

E08

Section 4.1 Character and Entity References

Change the first sentence of the text of the Entity Declared VC as follows:

In a document with an external subset or parameter entity references, if the document is not standalone (either "standalone='no'" is specified or there is no standalone declaration), then the Name given in the entity reference MUST match that in an entity declaration.
Rationale
The existing wording was ambiguous and did not explicitly address the case of an absent standalone declaration.

Errata as of 2007-12-05

E07

Section D Expansion of Entity and Character References

Append the following at the end of the Appendix:

In the following example

<!DOCTYPE foo [ 
<!ENTITY x "&lt;"> 
]> 
<foo attr="&x;"/>

the replacement text of x is the four characters "&lt;" because references to general entities in entity values are bypassed. The replacement text of lt is a character reference to the less-than character, for example the five characters "&#60;" (see 4.6 Predefined Entities). Since neither of these contains a less-than character the result is well-formed.

If the definition of x had been

<!ENTITY x "&#60;">

then the document would not have been well-formed, because the replacement text of x would be the single character "<" which is not permitted in attribute values (see WFC: No < in Attribute Values).

Rationale
This is an editorial clarification of a case that remained confusing.

E05

Section 4.3.3 Character Encoding in Entities

Change the second sentence of the first paragraph to read:

The terms "UTF-8" and "UTF-16" in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16. related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.
Section F Autodetection of Character Encodings

Change the last sentence of the first paragraph to read:

We will consider the first case first. these cases in turn.
Rationale
The former reading in 4.3.3 still caused some confusion, especially with respect to UTF-16BE and UTF-16LE.

E04

Section 5.2 Using XML Processors

Amend the last sentence of the second item of the bulleted list so that it reads:

For example, a non-validating processor may fail to normalize attribute values, include the replacement text of internal entities, or supply default attribute values, where doing so depends on having read declarations in external or parameter entities, or in the internal subset after an unread parameter entity reference.
Rationale
Improve the informativeness of the example sentence.

Errata as of 2007-09-25

E03

Section 4.1 Character and Entity References

Remove a duplicate "to" from the last paragraph of the description of the "Entity Declared" WFC, so that it reads:

Note that non-validating processors are not obligated to to read and process entity declarations occurring in parameter entities or in the external subset; for such documents, the rule that an entity must be declared is a well-formedness constraint only if standalone='yes'.
Rationale
This was a typo.

Last updated $Date: 2008/01/18 18:25:05 $ by $Author: jigsaw $

xml-editor