This document records all known errors in the Second Edition of the Extensible Markup Language (XML) 1.0 Specification; for updates see the latest version.
The errata are numbered, classified as Substantive, Editorial or Clarification and listed in reverse chronological order of their date of publication. Changes to the text of the spec are indicated thus: deleted text, new text, modified text.
Please email error reports to xml-editor@w3.org.
Change the fourth paragraph so that it reads:
In the content of elements, character data is any string of characters which does not contain the start-delimiter of any markup and does not include the CDATA-section-close delimiter, "]]>
". In a CDATA section, character data is any string of characters not including the CDATA-section-close delimiter, "]]>
".
]]>
" in the content of elements. The
clarification fixes what is believed to be an oversight in the fourth paragraph.Augment the last sentence of the last paragraph before production [30] so that it reads:
However, portions of the contents of the external subset or of these external parameter entities MAY conditionally be ignored by using the conditional section construct; this is not allowed in the internal subset but is allowed in external parameter entities referenced in the internal subset.
Change the first paragraph to read:
[Definition: Conditional sections are portions of the document type declaration external subset or of external parameter entities which are included in, or excluded from, the logical structure of the DTD based on the keyword which governs them.]
Add the following note immediately after production [3]:
Note: The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal.
Add the following paragraph at the end of the section:
Note that when processing invalid documents with a non-validating processor the application may not be presented with consistent information. For example, several requirements for uniqueness within the document may not be met, including more than one element with the same id, duplicate declarations of elements or notations with the same name, etc. In these cases the behavior of the parser with respect to reporting such information to the application is undefined.
Amend the first paragraph after the example declarations so that it reads:
The value "default" signals that applications' default white-space processing modes are acceptable for this element; the value "preserve" indicates the intent that applications preserve all the white space. This declared intent is considered to apply to all elements within the content of the element where it is specified, unless [E13]overridden with another instance of thexml:space
attribute. This specification does not give meaning to any value ofxml:space
other than "default" and "preserve". It is an error for other values to be specified; the XML processor MAY report the error or MAY recover by ignoring the attribute specification or by reporting the (erroneous) value to the application. Applications may ignore or reject erroneous values.
Modify the last sentence of the paragraph immediately before the first note, so that it reads:
Names beginning with the string "xml
", or with any string which would match(('X'|'x') ('M'|'m') ('L'|'l'))
, are reserved for standardization in this or future versions of this specification.
In the table of required processor behavior, change the entry for "Reference in EntityValue" to an "Unparsed" entity from "Forbidden" to "Error".
Augment the first item in the bullet list in section 4.4.4 to read:
Add a new subsection as follows:
4.4.9 Error
It is an error for a reference to an unparsed entity to appear in the EntityValue in an entity declaration.
Augment the "Unicode3" entry so that it reads:
Remove the "Berners-Lee et al." entry.
Expand the [ISO/IEC 10646] bibliographic entry so that it reads:
Change the URL for the [IANA-CHARSETS] entry to http://www.iana.org/assignments/character-sets.
Change the URL for the [IANA-LANGCODES] entry to http://www.iana.org/assignments/language-tags.
Change the second paragraph to read:
To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
Add a note after the first paragraph following the first example:
Note:
Language information may also be provided by
external transport protocols (e.g. HTTP or MIME). When available, this
information may be used by XML applications, but the more local information
provided by xml:lang
should be considered to override it.
Change item #4 in the numbered list at the end of the section to read:
The declaration matches ANY, and the content (after replacing any entity references with their replacement text) consists of character data and child elements whose types have been declared.
Augment the last sentence of the last paragraph so that it reads:
Except whenstandalone="yes"
, they must not process entity declarations or attribute-list declarations encountered after a reference to a parameter entity that is not read, since the entity may have contained overriding declarations; whenstandalone="yes"
, processors must process these declarations.
Delete the last sentence of the first paragraph (the sentence starting 'The use of "compatibility characters",...'.
At the end of the section, add the following:
Note:
Document authors are encouraged to avoid "compatibility characters", as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#1FFFE-#x1FFFF], [#2FFFE-#x2FFFF], [#3FFFE-#x3FFFF], [#4FFFE-#x4FFFF], [#5FFFE-#x5FFFF], [#6FFFE-#x6FFFF], [#7FFFE-#x7FFFF], [#8FFFE-#x8FFFF], [#9FFFE-#x9FFFF], [#AFFFE-#xAFFFF], [#BFFFE-#xBFFFF], [#CFFFE-#xCFFFF], [#DFFFE-#xDFFFF], [#EFFFE-#xEFFFF], [#FFFFE-#xFFFFF], [#10FFFE-#x10FFFF].
Modify the paragraph introduced by E10 so that it reads:
It is an error if an attribute value contains a reference to an entity for which no declaration has been read. This can happen only when a non-validating processor is being used.
Change the last sentence of the first paragraph so that it reads:
The validity constraints noted in the grammar are applied after the attribute value has been normalized as described in 3.3.3 Attribute-Value Normalization.
Rewrite the paragraph beginning "[Definition: The SystemLiteral is called the entity's system identifier.", the following paragraph and the following numbered list, so that they read:
[ Definition: The SystemLiteral is called the entity's system identifier. It is meant to be converted to a URI reference (as defined in [IETF RFC 2396], updated by [IETF RFC 2732]), as part of the process of dereferencing it to obtain input for the XML processor to construct the entity's replacement text.] It is an error for a fragment identifier (beginning with a#
character) to be part of a system identifier. Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs. This is defined to be the external entity containing the '<' which starts the declaration, at the point when it is parsed as a declaration. A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity. Attempts to retrieve the resource identified by a URI may be redirected at the parser level (for example, in an entity resolver) or below (at the protocol level, for example, via an HTTPLocation:
header). In the absence of additional information outside the scope of this specification within the resource, the base URI of a resource is always the URI of the actual resource returned. In other words, it is the URI of the resource retrieved after all redirection has occurred.
System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 2396] and [IETF RFC 2732], must be escaped before a URI can be used to retrieve the referenced resource. The characters to be escaped are the contol characters #x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. Since escaping is not always a fully reversible process, it must be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it should trigger escaping. When escaping does occur, it must be performed as follows:
Each disallowed character to be escaped is represented in UTF-8 [Unicode3] as one or more bytes.
The resulting bytes are escaped with
the URI escaping mechanism (that is, converted to %
HH,
where HH is the hexadecimal notation of the byte value).
The original character is replaced by the resulting character sequence.
In the first paragraph, change "...should react..." to "...is to react...".
Change the last sentence of the second paragraph to read:
If a default value is declared, When an XML processor encounters an element without a specification for an attribute for which it has read a default value declaration, it must report the attribute with the declared default value to the applicationomitted attribute, it is to behave as though the attribute were present with the declared default value.
Modify the last sentence of the first paragraph so that it reads:
The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string is allowed.
Append the following to the paragraph immediately following the first example:
In particular, the empty value ofxml:lang
is used on an element B to override a specification ofxml:lang
on an enclosing element A, without specifying another language. Within B, it is considered that there is no language information available, just as ifxml:lang
had not been specified on B or any of its ancestors.
Change the sample declaration of xml:lang
to:
xml:lang CDATA #IMPLIED
Change the last set of examples to read:
<!ATTLIST poem xml:lang CDATA 'fr'> <!ATTLIST gloss xml:lang CDATA 'en'> <!ATTLIST note xml:lang CDATA 'en'>
Amend the last sentence of the last paragraph to read:
Applications which require DTD facilities not related to validation (such as the declaration of default attributes and internal entities) that are or may be specified in external entities should use validating XML processors.
Remove the whole paragraph after the second example. This paragraph reads:
The version number "1.0" should be used to indicate conformance to this version of this specification; it is an error for a document to use the value "1.0" if it does not conform to this version of this specification. It is the intent of the XML working group to give later versions of this specification numbers other than "1.0", but this intent does not indicate a commitment to produce any future versions of XML, nor if any are produced, to use any particular numbering scheme. Since future versions are not ruled out, this construct is provided as a means to allow the possibility of automatic version recognition, should it become necessary. Processors may signal an error if they receive documents labeled with versions they do not support.
Change production [26] VersionNum to read:
[26] VersionNum ::= '1.0'
Change the definition for "#xN" to read:
whereN
is a hexadecimal integer, the expression matches the character whose number (code point) in ISO/IEC 10646 isN
whose canonical (UCS-4) code value, when interpreted as an unsigned binary number, has the value indicated. The number of leading zeros in the#xN
form is insignificant; the number of leading zeros in the corresponding code value is governed by the character encoding in use and is not significant for XML.
Change the third item of the bullet list of conditions for the "Standalone Document Declaration" VC to:
Change the first sentence of the 4th paragraph to read:
The function of the markup in an XML document is to describe its storage and logical structure and to associate attribute-valueattribute name-value pairs with its logical structures.
Change the next to last sentence of the paragraph immediately preceding the "Proper Group/PE Nesting" VC to read:
For compatibility, it is an error if the content model allows an element to match more than one occurrence of an element type in the content model.
Restore linebreaks in the first and next-to-last examples that were lost between the 1st and 2nd edition:
<?xml version="1.0"?>
<greeting>Hello, world!</greeting>
<?xml version="1.0"?>
<!DOCTYPE greeting SYSTEM "hello.dtd">
<greeting>Hello, world!</greeting>
Remove the last 5 words from the last sentence of the first paragraph, so that it reads:
The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor on the IETF Standards Track.
Remove the entire Note following the first paragraph (already amended by E11):
Note:
[IETF RFC 3066] tags are constructed from two-letter language codes as defined by [ISO 639], from two-letter country codes as defined by [ISO 3166], or from language identifiers registered with the Internet Assigned Numbers Authority [IANA-LANGCODES].
Last paragraph: add a new 3rd sentence:
"Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregular code unit sequences, as defined in Unicode 3.1."
with a reference to Unicode 3.1.
Change the [Unicode3] entry (leaving the anchor name unchanged) to read:
The Unicode Consortium. The Unicode Standard, Version 3.1, defined by: The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27).
Rewrite the paragraph beginning "[Definition: The SystemLiteral is called the entity's system identifier.", the following paragraph and the following numbered list, so that they read:
[
Definition: The SystemLiteral
is called the entity's system identifier. It is
meant to be converted to a URI
reference (as defined in [IETF RFC 2396],
updated by [IETF RFC 2732]),
as part of the process of dereferencing
it to obtain input for the XML processor to construct
the entity's replacement text.] It is an error for a fragment
identifier (beginning with a #
character) to be
part of a system identifier. Unless otherwise provided by
information outside the scope of this specification (e.g. a
special XML element type defined by a particular DTD, or a
processing instruction defined by a particular application
specification), relative URIs are relative to the location of
the resource within which the entity declaration occurs. A URI
might thus be relative to the document entity, to the entity containing the
external
DTD subset, or to some other external parameter entity.
System identifiers (and other XML strings meant to be used as URI references) may contain characters that, according to [IETF RFC 2396] and [IETF RFC 2732], must be escaped before a URI can be used to retrieve the referenced resource. The characters to be escaped are the contol characters #x0 to #x1F and #x7F (most of which cannot appear in XML), space #x20, the delimiters '<' #x3C, '>' #x3E and '"' #x22, the unwise characters '{' #x7B, '}' #x7D, '|' #x7C, '\' #x5C, '^' #x5E and '`' #x60, as well as all characters above #x7F. Since escaping is not always a fully reversible process, it must be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it should trigger escaping. When escaping does occur, it must be performed as follows:
Each disallowed character to be escaped is represented in UTF-8 [IETF RFC 2279] as one or more bytes.
The resulting bytes are escaped with
the URI escaping mechanism (that is, converted to %
HH,
where HH is the hexadecimal notation of the byte value).
The original character is replaced by the resulting character sequence.
Amend the second sentence of the next-to-last paragraph to read:
An XML processor attempting to retrieve the entity's content may use any combination of the public and system identifiers as well as additional information outside the scope of this specification to try to generate an alternative URI reference.
Change the last sentence of the third paragraph to read:
The right angle bracket (>) may be represented using the string ">", and must, for compatibility, be escaped using either ">" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.
Amend the last sentence of the next-to-last paragraph to read:
Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Amend the second paragraph to read:
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
Add a new production [28b] and modify production [28] to refer to it:
[28] | doctypedecl |
::= | '<!DOCTYPE' S Name
(S ExternalID)? S?
('[' intSubset ']'
S?)? '>' |
[VC: Root Element Type] |
[WFC: External Subset] | ||||
[28a] | DeclSep |
::= | PEReference | S |
[WFC: PE Between Declarations] |
[28b] | intSubset |
::= | (markupdecl | DeclSep)* |
|
[29] | markupdecl |
::= | elementdecl | AttlistDecl
| EntityDecl | NotationDecl
| PI | Comment |
[VC: Proper Declaration/PE Nesting] |
[WFC: PEs in Internal Subset] |
Change productions [6] Names and [8] Nmtokens to use #x20 (a single space character) instead of S:
[6] Names ::= Name (#x20 Name)*
[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*
Add a note after production 8:
Note: The Names and Nmtokens productions are used to define the validity of tokenized attribute values after normalization (see 3.3.1 Attribute Types).
This restores first edition erratum E62, which was rescinded by E108. It seems likely that when E108 was adopted the productions were incorrectly thought to apply to unnormalized attribute values, which would have prevented the use of non-#x20 whitespace (tabs and newlines) as separators in tokenized attribute values. In fact, it only prohibits the use of character references to these characters.
This change restores SGML compatibility (cf. the "name list" and "name token list" productions in SGML).
Modify the third sentence of the second paragraph, so that it reads:
The actual replacement text that is included (or included in literal) as described above must contain the replacement text of any parameter entities referred to, and must contain the character referred to, in place of any character references in the literal entity value; however, general-entity references must be left as-is, unexpanded.
To the sentence:
Unless otherwise provided by information outside the scope of this specification (e.g. a special XML element type defined by a particular DTD, or a processing instruction defined by a particular application specification), relative URIs are relative to the location of the resource within which the entity declaration occurs.
(inside the paragraph following the Notation declared VC), append the following:
This is defined to be the external entity containing the '<' which starts the declaration, at the point when it is parsed as a declaration.
This clarifies exactly where a declaration occurs, for purposes of determining the base for relative URIs. Given the example:
example.xml: <!DOCTYPE foo [ <!ENTITY % pe SYSTEM "subdir1/pe"> %pe; %intpe; ]> <foo>&ent;</foo> subdir1/pe: <!ENTITY % extpe SYSTEM "../subdir2/extpe"> <!ENTITY % intpe "%extpe;"> subdir2/extpe <!ENTITY ent SYSTEM 'entfile'>
Though the characters making up the declaration of ent
appear in
subdir2/extpe
, they are not parsed as a declaration there. They are
just treated as characters making up the replacement text of intpe
.
They are not parsed as a declaration until intpe
is parsed, at which
point the containing external entity is the document entity, so the
relevant base URI is that of example.xml
.
The fact that it is the containing external entity that is used may be summed up by saying that internal entities do not carry any base URI with them; indeed, they consist only of their replacement text.
If example.xml
contained %extpe;
instead of %intpe;
the situation
would be different: the contents of subdir2/extpe
would be parsed as
a declaration, and the relevant base URI would be that of subdir2
From the definition for "A | B", delete "but not both":
Move the entries for [IETF RFC 2396] and [IETF RFC 2732] from A.2 (informative) to A.1 (normative).
Rewrite the Element valid VC as follows:
Validity constraint: Element Valid
An element is valid if there is a declaration matching elementdecl where the Name matches the element type, and one of the following holds:
The declaration matches EMPTY and the element has no content (not even entity references, comments, PIs or white space).
The declaration matches children and the sequence of child elements (after replacing any entity references with their replacement text) belongs to the language generated by the regular expression in the content model, with optional white space (characters matching the nonterminal S), comments and PIs (i.e. markup matching production [27] Misc) between the start-tag and the first child element, between child elements, or between the last child element and the end-tag. Note that a CDATA section containing only white space or a reference to an entity whose replacement text is character references expanding to white space do not match the nonterminal S, and hence cannot appear in these positions; however, a reference to an internal entity with a literal value consisting of character references expanding to white space does match S, since its replacement text is the white space resulting from expansion of the character references.
The declaration matches Mixed and the content (after replacing any entity references with their replacement text) consists of character data, comments, PIs and child elements whose types match names in the content model.
The declaration matches ANY, and the types of any child elements (after replacing any entity references with their replacement text) have been declared.
In the paragraph just after production [43] content, amend the definition of empty element so that the word "content" within the definition is a link to production [43].
Amend the last paragraph so that it reads:
A consequence of well-formedness in general entities is that the logical and physical structures in an XML document are properly nested; no start-tag, end-tag, empty-element tag, element, comment, processing instruction, character reference, or entity reference can begin in one entity and end in another.
"General" is added because:
This clarifies that the following from the OASIS test suite:
xmltest/invalid/001.xml: <!DOCTYPE doc SYSTEM "001.ent"> <doc></doc> with 001.ent: <!ELEMENT doc EMPTY> <!ENTITY % e "<!--"> %e; -->
is well-formed but violates a validity constraint.
In the first paragraph after the example, replace "overriden" with "overridden" (two d's) in the sentence "This declared intent is considered to apply to all elements within the content of the element where it
is specified, unless overridden with another instance of the xml:space
attribute."
Change the [IETF RFC 2376] reference to [IETF RFC 3023] (keeping the same #RFC2376 fragment identifier in order not to break existing links).
Change the IETF RFC 2376 entry to:
Amend the next to last paragraph so that it reads:
This specification, together with associated standards (Unicode and ISO/IEC 10646 for characters, Internet RFC 3066 for language identification tags, ISO 639 for language name codes, and ISO 3166 for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it.
[The only change is that "RFC 1766" becomes "RFC 3066".]
Change all [IETF RFC 1766] references to [IETF RFC 3066] (keeping the same #RFC1766 fragment identifier in order not to break existing links).
Remove the last sentence of the Note: "It is expected that the successor to [IETF RFC 1766] will introduce three-letter language codes for languages not presently covered by [ISO 639]."
Change the IETF RFC 1766 entry to:
Just after the paragraph beginning "All attributes for which no declaration has been read..." (just before the examples), append the following paragraph:
It is an error if an attribute refers to an entity when there is a declaration for that entity which the processor has not read. This can happen only when a non-validating processor is being used.
Change the title and the text of Attribute Default Legal Validity Constraint to:
Validity Constraint: Attribute Default Value Syntactically Correct
The declared default value must meet the syntactic constraints of the declared attribute type.
Note that only the syntactic constraints of the type are required here; other constraints (e.g. that the value be the name of a declared unparsed entity, for an attribute of type ENTITY) may come into play if the declared default value is actually used (an element without a specification for this attribute occurs).
Change the first sentence of the second paragraph of the Entity Declared WFC (not the VC of the same name) to read:
Note that non-validating processors are not obligated to read and process entity declarations occurring in parameter entities or in the external subset.
Remove the word "internal" from the title of the section.
Change the first paragraph, in particular removing the word "internal", so that it reads:
In discussing the treatment of internal entities, it is useful to distinguish two forms of the entity's value. [Definition: For an internal entity, the literal entity value is the quoted string actually present in the entity declaration, corresponding to the non-terminal EntityValue.] [Definition: For an external entity, the literal entity value is the exact text contained in the entity.] [Definition: For an internal entity, the replacement text is the content of the entity, after replacement of character references and parameter-entity references.] [Definition: For an external entity, the replacement text is the content of the entity, after stripping the text declaration (leaving any surrounding whitespace) if there is one but without any replacement of character references or parameter-entity references.]
Modify the second example in the table at the end of the section to read as follows (add a   in the middle):
|
A #x20 B |
#x20 #x20 A #x20 #x20 #x20 B #x20 #x20 |
Replace the last sentence of the paragraph beginning with "URI references require encoding and escaping of certain characters." with the following:
The XML processor must escape disallowed characters as follows:
After the sentence reading "A URI might thus be relative to the document entity, to the entity containing the external DTD subset, or to some other external parameter entity.", which follows the definition of SystemLiteral, add the following:
Attempts to retrieve the resource identified by a URI may be redirected at the parser
level (for example, in an entity resolver) or below (at the protocol level, for example, via an HTTP
Location:
header). In the absence of additional information outside the scope of this specification
within the resource, the base URI of a resource is always the URI of the actual resource returned. In other words,
it is the URI of the resource retrieved after all redirection has occurred.
Add a validity constraint applying to productions [58] NotationType
and [59] Enumeration
as follows:
Validity constraint: No duplicate tokens
The notation names in a single NotationType attribute declaration, as well as the NmTokens in a single Enumeration attribute declaration, must all be distinct.