XML processor bugs

This is a non-exhaustive list of typical XML processor bugs that fail to work with modular DTDs.

Table of contents

Bug1: Unicode characters beyond Basic Multilingual Plane

Some XML processors incorrectly reject an XML document that use characters beyond the Basic Multilingual Plane (BMP). Such XML processors fail to work with the MathML 2.0 DTD, and of course with any XML instances that use characters beyond BMP.

Test case:

Test your XML processor

bug1.xml

<!DOCTYPE Plane1Char[
<!ELEMENT Plane1Char (#PCDATA) >
<!-- MATHEMATICAL DOUBLE-STRUCK SMALL A -->
<!ENTITY aopf "&#x1D552;" ><!-- it's not an invalid Unicode character! -->
]>
<Plane1Char>
&aopf;
&#x1D552;
</Plane1Char>

"Char" production in XML 1.0 is defined as follows:

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Bug2: Reference to an undeclared parameter entity inside an ignored conditional section

Some XML processors incorrectly try to recognize the parameter entity even if that parameter entity is enclosed inside an IGNOREd conditional section. Such XML processors fail to work with the XHTML 1.1 and SVG 1.1 DTDs.

Test case:

Test your XML processor

bug2.dtd

<![IGNORE[
%this_parameter_entity_reference_shall_not_be_recognized;
]]>
<!ELEMENT foo (#PCDATA) >

bug2.xml

<!DOCTYPE foo SYSTEM "bug2.dtd">
<foo>Can you see this?</foo>

3.4 Conditional Sections of XML 1.0 says as follows (emphasis added for clarity):

If the keyword of the conditional section is INCLUDE, then the contents of the conditional section MUST be considered part of the DTD. If the keyword of the conditional section is IGNORE, then the contents of the conditional section MUST be considered as not logically part of the DTD. If a conditional section with a keyword of INCLUDE occurs within a larger conditional section with a keyword of IGNORE, both the outer and the inner conditional sections MUST be ignored. The contents of an ignored conditional section MUST be parsed by ignoring all characters after the "[" following the keyword, except conditional section starts "<![" and ends "]]>", until the matching conditional section end is found. Parameter entity references MUST NOT be recognized in this process.

Bug3: More than one attribute definition for a given attribute name in an attribute-list declaration

Some XML processors incorrectly report an error (rather than a warning) when they encounter more than one attribute definition for a given attribute name in an attribute-list declaration. Such XML processors fail to work with the SVG 1.1 DTD.

Test case:

Test your XML processor

bug3.xml

<!DOCTYPE AttList[
<!ELEMENT AttList EMPTY >
<!ATTLIST AttList
   attribute CDATA #IMPLIED
   attribute CDATA #FIXED 'This declaration must be ignored.'
>
]>
<AttList attribute="The first declaration is binding."/>

3.3 Attribute-List Declarations of XML 1.0 says as follows (emphasis added for clarity):

When more than one AttlistDecl is provided for a given element type, the contents of all those provided are merged. When more than one definition is provided for the same attribute of a given element type, the first declaration is binding and later declarations are ignored. For interoperability, writers of DTDs MAY choose to provide at most one attribute-list declaration for a given element type, at most one attribute definition for a given attribute name in an attribute-list declaration, and at least one attribute definition in each attribute-list declaration. For interoperability, an XML processor MAY at user option issue a warning when more than one attribute-list declaration is provided for a given element type, or more than one attribute definition is provided for a given attribute, but this is not an error.

Bug4: Unread entities in external DTD subset

Some non-validating XML processors incorrectly report an error when they encounter an "unrecognized" entity which is actually declared in an external DTD subset but they didn't read.

Test case:

Test your XML processor

bug4.dtd

<!ELEMENT rant (#PCDATA) >
<!ENTITY foo "I am declared!" >

bug4.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE rant SYSTEM "bug4.dtd">
<rant>The entity "&foo;" is declared in the external DTD subset!</rant>

Well-formedness constraint: Entity Declared of XML 1.0 says as follows (emphasis added for clarity):

Well-formedness constraint: Entity Declared

In a document without any DTD, a document with only an internal DTD subset which contains no parameter entity references, or a document with "standalone='yes'", for an entity reference that does not occur within the external subset or a parameter entity, the Name given in the entity reference MUST match that in an entity declaration that does not occur within the external subset or a parameter entity, except that well-formed documents need not declare any of the following entities: amp, lt, gt, apos, quot. The declaration of a general entity MUST precede any reference to it which appears in a default value in an attribute-list declaration.

Note that non-validating processors are not obligated to to read and process entity declarations occurring in parameter entities or in the external subset; for such documents, the rule that an entity must be declared is a well-formedness constraint only if standalone='yes'.

Reports From the W3C SGML ERB to the SGML WG And from the W3C XML ERB to the XML SIG explains why this was decided to be a Validity Constraint rather than a Well-formedness Constraint (emphasis added for clarity):

S.40 Should Entity Declared be a VC or a WFC?

Decision: In a standalone document (one without a DTD, one with only an internal subset and no references to external parameter entities, or one with "standalone='yes'"), this constraint should be treated as a WFC: i.e. it must be checked by all conforming processors. In a document with a DTD and "standalone='no'", it should be treated as a VC.

Unanimous (MMal and EM abstaining).

Rationale: it cannot be a WFC without serious injury to the notion of Draconian error handling. As the current draft (97-11-17) makes explicit, a non-validating processor cannot be expected to know whether an entity declaration for an entity being referred to does or does not occur in some external parameter entity or external DTD subset. But if the constraint is a well-formedness constraint, even a non-validating processor should catch the error. So for "standalone='no'", it should be a VC -- a constraint enforceable only if one reads the entire DTD.

For documents without a DTD, however, or with "standalone='yes'", a non-validating processor can in fact be expected to know what entities have been declared (if there is no DTD, none have been; if standalone='yes', only those declared in the internal subset) and to tell whether an entity referred to in the document is one of them. Failure to require all processors to detect this as a fatal error would lead to all sorts of ad hoc bad practice. Some WG members speculated that failing to make this a WFC would mean the entity 'today' would immediately be declared to mean 'the date given by the system clock' and so on -- an entire API could be defined as a set of magic entity names, with parts of the name separated by dots or hyphens being treated as parameters ... A brief consideration of this possibility quickly led all doubters to agree that undeclared entities should be a WFC in all cases where a non-validating processor can be expected to detect them.


Last updated on $Date: 2004/08/24 07:45:38 $

Masayasu Ishikawa