This document is an internal working document of the XML Core WG. It maintains a running log of potential errata to the XML 1.0 spec, 4th edition (dated 2006-08-16), to the XML 1.1 spec, 2nd edition (dated 2006-08-16) and to the XML 1.0 spec, 5th. edition (dated 2008-11-26). It is therefore the successor to the Running log of potential errata for XML 1.0 3rd edition and XML 1.1 1st edition. It is meant to be a living document, frequently updated as new errata are discovered and as they are disposed of by the WG.
When a potential erratum is resolved, its entry in this document is moved to the Resolved cases section and, if appropriate (it is a real erratum, not a false alarm or a request for enhancement that cannot be resolved by an erratum), the official XML 1.0 4th edition errata page, the official XML 1.1 2nd edition errata page or official XML 1.0 5th edition errata page is updated.
PE165 | Add note on Unicode normalization | |||
---|---|---|---|---|
| ||||
Problem statement |
From Addison Phillips: Dear XML Core WG, I am writing on behalf of both the Internationalization Core WG and the HTML Coordination Group (HCG). Recently there has been an extensive discussion of normalization in W3C specifications, mainly related to handling of element and attribute names and values (as in CSS3 Selectors). Some of this discussion revolves around how Unicode normalization should work with XML and XML-derived specifications, hence I was actioned by HCG [0] to contact you folks. I produced a general summary of the Unicode normalization problem at [1] for the HCG. Those unfamiliar with Unicode normalization may wish to review that message. The basic question is whether XML can (or should?) take a clearer stance on Unicode normalization. At present, XML 1.0 5e, like its predecessors, does not require any particular normalization form; it says nothing about whether canonical equivalents in Unicode are "equal" from an XML point of view; and thus implies that Unicode canonical equivalence does *not* apply when considering an XML document's formation. The recommendations in Appendix J (which does include normalization among its suggestions) further suggest that this is true. On the other hand, it seems reasonable to suppose that Unicode canonical equivalence might apply to XML. Processes such as transcoding legacy charsets to Unicode might result in canonically-equivalent-but-unequal code point sequences, for example. In a survey done at I18N's behest, our Unicode liaison (Mark Davis) produced a survey of content of the Web, as well as a summary on performance [2], which found that 99.98% of Web HTML content was, in fact, in Unicode form NFC. It seems reasonable to suppose that XML content and documents would follow a similar pattern. Our questions to XML Core WG, thus, are: What, precisely, should XML say with regard to Unicode canonical equivalence? Would it be possible to require or allow canonical equivalents to be treated as identical directly in XML (and not merely as a side effect of other specifications)? Is there a problem if XML permits/requires canonically-equivalent-yet-different sequences to be treated as distinct if other specifications require/allow canonical equivalence to be recognized? The Internationalization Core WG would be happy to work with you on these thorny issues. Please advise if you need more information, consultation, participation, or just need to vent :-). Kind Regards, Addison (for I18N/HCG) | |||
Proposed resolution |
|
PE122 | Revisiting E15 (from second edition errata) | |||
---|---|---|---|---|
| ||||
Problem statement |
From Jonathan Marsh: I've been looking into E15 more fully. Besides the backward compatibility cost, there appears to be an implementability issue. Microsoft parsers allow empty and element-only content to contain entity references as long as those references expand to whitespace or to nothing. To do otherwise involves a substantial reworking of the parser implementation strategy in use, making such a change very expensive, as well as breaking any documents previously relying on this behavior (though the number of such documents is likely to be small). Our implementation difficulties are surfaced in the spec through an obvious inconsistency in the spec. Validation of attributes is done after entity expansion (according to E20), but prior to character entity expansion in elements (according to E15). There appears to be no clear reason why these contexts must differ. Microsoft parsers accept all documents conformant to this erratum, but may also accept some documents (which are unlikely to occur in the wild) which do not conform to the constraints of this erratum. In particular, we fail the following test cases (by parsing each document without error): E15a.xml: <!DOCTYPE foo [ <!ELEMENT foo EMPTY> <!ENTITY empty ""> ]> <foo>∅</foo> E15g.xml: <!DOCTYPE foo [ <!ELEMENT foo (foo*)> ]> <foo><foo/> <foo/></foo> E15h.xml <!DOCTYPE foo [ <!ELEMENT foo (foo*)> <!ENTITY space "&#32;"> ]> <foo><foo/>&space;<foo/></foo> | |||
Discussion |
From Jonathan Marsh: > 3) Microsoft is problematic because unclear, Microsoft needs to tell us > what issue exactly they have with E15. A description of the problem with E15 is at http://lists.w3.org/Archives/Member/w3c-xml-core-wg/2003OctDec/0227.html. The main objections are: 1) difficulty of implementation 2) entity expansion in attributes and elements is treated inconsistently (E20 vs. E15) 3) not backward compatible with deployed versions of MSXML I note that failure to support E15 does not affect the infoset of the parsed document. I would like to propose a solution, but I'm actually having trouble understanding the erratum and how it led to the test cases we fail. Perhaps somebody could help me understand the test cases better. In http://www.w3.org/TR/2003/PER-xml-20031030/PER-xml-20031030-review.html#elementvalid, I find: "... however, a reference to an internal entity with a literal value consisting of character references expanding to white space does match S, since its replacement text is the white space resulting from expansion of the character references." I think this specifically makes the following test case valid: <!DOCTYPE foo [ <!ELEMENT foo (foo*)> <!ENTITY space " "> ]> <foo><foo/>&space;<foo/></foo> If that is the case, it is hard to see why the test cases we have problems with are not valid: E15g.xml: <!DOCTYPE foo [ <!ELEMENT foo (foo*)> ]> <foo><foo/> <foo/></foo> E15h.xml <!DOCTYPE foo [ <!ELEMENT foo (foo*)> <!ENTITY space "&#32;"> ]> <foo><foo/>&space;<foo/></foo> And for consistency the similar situation for EMPTY content: E15a.xml: <!DOCTYPE foo [ <!ELEMENT foo EMPTY> <!ENTITY empty ""> ]> <foo>∅</foo> From Richard Tobin: I think the idea is that all the pointless possibilities that can be ruled out, are ruled out. > I think this specifically makes the following test case valid: > > <!DOCTYPE foo [ > <!ELEMENT foo (foo*)> > <!ENTITY space " "> > ]> > <foo><foo/>&space;<foo/></foo> Yes. Entity references to space-separated sequences of elements have to be valid, and this is just an empty sequence. The character reference can't be ruled out because it's gone by the time you know what the entity is being used for. > If that is the case, it is hard to see why the test cases we have > problems with are not valid: > > E15g.xml: > <!DOCTYPE foo [ > <!ELEMENT foo (foo*)> > ]> > <foo><foo/> <foo/></foo> > > E15h.xml > <!DOCTYPE foo [ > <!ELEMENT foo (foo*)> > <!ENTITY space "&#32;"> > ]> > <foo><foo/>&space;<foo/></foo> Well, if you regard it as a deficiency that the first case can't be ruled out, then it's a deficiency that does not apply to these cases. > And for consistency the similar situation for EMPTY content: > > E15a.xml: > <!DOCTYPE foo [ > <!ELEMENT foo EMPTY> > <!ENTITY empty ""> > ]> > <foo>∅</foo> There is no good use for an entity reference in an EMPTY element, so it can be ruled out. | |||
Resolution | There is nothing we can do about that. |
PE139 | Changing XML spec to use IRIs for system IDs | |||
---|---|---|---|---|
| ||||
Problem statement |
From Richard Tobin: In 4.2.2, replace the paragraph beginning "System identifiers (and other ..." and the following 3-item list with: System identifiers (and other XML strings meant to be used as URI references) are converted to URI references as described in [IRIs RFC 3987]. They MAY contain characters that, according to [new URIs RFC3986], must be escaped before a URI can be used to retrieve the referenced resource. XML processors MUST escape them as described in section 3.1 of [IRIs RFC 3987]. We may want to include this existing text: Since escaping is not always a fully reversible process, it MUST be performed only when absolutely necessary and as late as possible in a processing chain. In particular, neither the process of converting a relative URI to an absolute one nor the process of passing a URI reference to a process or software component responsible for dereferencing it SHOULD trigger escaping. but I am uncertain as to whether [Base URI] in the infoset should have them escaped. | |||
Discussion |
From the minutes of 2005-04-20: But [Richard] points out that he doesn't suggest we make this change until and unless we change the references to 2396 to 3986. Richard suggests we defer this erratum for now. CONSENSUS to defer this erratum for now. | |||
Resolution | 2007-12-05: Superceded by PE161. |
PE150 | ISO 639 and 3166 | |||
---|---|---|---|---|
| ||||
Problem statement |
From Addison Phillips: 1. Section 1.1 contains this paragraph: -- This specification, together with associated standards (Unicode [Unicode] and ISO/IEC 10646 [ISO/IEC 10646] for characters, Internet RFC 3066 [IETF RFC 3066] for language identification tags, ISO 639 [ISO 639] for language name codes, and ISO 3166 [ISO 3166] for country name codes), provides all the information necessary to understand XML Version 1.0 and construct computer programs to process it. -- I think the references to ISO 639 and ISO 3166 should be replaced with a reference to the IANA Language Subtag Registry. I note that this is the only place that these references appear. 2. I recognize that there is difficulty in replacing the reference to RFC 3066 currently, even though that document is now obsolete, since 3066bis has not be published by the RFC Editor. 3. Section A2 contains a reference [IANA-LANGCODES] which is not referenced anywhere. Furthermore, this reference is to the now obsolete and closed Language Tag registry. Changing it to the Language Subtag registry would be more appropriate. The new URL is: http://www.iana.org/assignments/language-subtag-registry | |||
Discussion |
The successor to RFC 3066 has now been published as RFC 4646. | |||
Resolution |
|
PE151 | missing 16 non-characters | |||
---|---|---|---|---|
| ||||
Problem statement |
From Frank Ellermann: Hi, the page http://www.w3.org/TR/REC-xml/#charsets mentions in a note, that some characters are discouraged, in essence all C1 controls excl. NEL and the 66 non-characters. The 66 non-characters consist of 17 planes * 2 (??FFFE, ??FFFF) and 32 u+FDD0 up to u+FDEF. This XML 1.0 document says u+FDDF instead of u+FDEF. The page http://www.unicode.org/Public/UNIDATA/DerivedAge.txt claims that these 32 non-characters were introduced together in Unicode version 3.1. The normative [Unicode3] reference in XML 1.0 4th ed. is based on Unicode version 3.2. | |||
Resolution |
|
PE152 | not obligated to to | |||
---|---|---|---|---|
| ||||
Problem statement |
From Dieter Köhler: Section 4.1, WFC: Entity Declared, last paragraph: "... non-validating processors are not obligated to to read ..." should be changed to "... non-validating processors are not obligated to read ..." | |||
Resolution |
|
PE153 | facilities /not/ related to validation | |||
---|---|---|---|---|
| ||||
Problem statement |
From Dieter Köhler: In section 5.2, last paragraph: "Applications which require DTD facilities not related to validation (such as the declaration of default attributes and internal entities that are or may be specified in external entities ) SHOULD use validating XML processors." the first "not" seems to be wrong and the sentence contains a superfluous space before the closing bracket. It should be changed to: "Applications which require DTD facilities related to validation (such as the declaration of default attributes and internal entities that are or may be specified in external entities) SHOULD use validating XML processors." | |||
Discussion |
The not is actually intended, should not be removed. The superfluous space is present only in 1.1, should be removed | |||
Resolution | In the 1.1 spec only:
|
PE154 | linking to WFCs in prod 60 | |||
---|---|---|---|---|
| ||||
Problem statement |
From Dieter Köhler: It also appears to me inconvenient that in section 3.3.2, prod. 60 the "WFC: No < in Attribute Values" and "WFC: No External Entity References" have no visible clue, where the text of these WFCs can be found. The WFCs are linked correctly, but no reference is available for the reader of a print-out of the spec. Therefore I suggest amending the text of the links with "see prod. 41". For consistency, I would also like to suggest changing the order of the WFCs to match those of prod. 41 and moving them to the end of the list. The list of VCs and WFCs of prod. 60 should look like: [VC: Required Attribute] [VC: Attribute Default Value Syntactically Correct] [VC: Fixed Attribute Default] [WFC: No External Entity References, see prod. 41] [WFC: No < in Attribute Values, see prod. 41] | |||
Resolution | Nice suggestion, but it turns out to be too diffficult to implement as the link text is generated by the stylesheet and there is no place to indicate the desired target (prod. 41 here). |
PE155 | prod. 68, VC Entity Declared | |||
---|---|---|---|---|
| ||||
Problem statement |
From Dieter Köhler: Section 4.1, prod. 68, VC Entity Declared: >>In a document with an external subset or parameter entity references with " standalone='no' ", the Name ...<< Here the scope of the condition >>with " standalone='no' "<< is ambiguous. In order to be consistent with the WFC Entity Declared the condition must apply to both, "external subset" and "parameter entity references", because in a document with an external subset and standalone='yes' a missing entity declaration is a well-formedness error. However the wording allows two options: "In a document with (A or B) with C" or "In a document with A or (B with C)". Of course one can rule out the second option as false on carefully comparing the wording of the VC Entity Declared with that of the WFC Entity Declared. But it is not easy to figure it out. However, there is a second problem: The condition of "standalone='no'" is equivalent to the condition that no standalone declaration exists, which can be inferred from the rule in section 2.9: "If there are external markup declarations but there is no standalone document declaration, the value 'no' is assumed." For clarification it would be good to remind the reader of this rule, in particular because the Courier type face of the words "standalone='no'" puts an emphasis on an explicit standalone declaration which is not intended. To summarize my suggestion, I would recommend that the sentence >>In a document with an external subset or parameter entity references with " standalone='no' ", the Name ...<< should be changed to something like >>For a document with "standalone = 'no'" or no standalone declaration, if this document has a DTD with an external subset or parameter entity references in its internal subset, the Name ...<< | |||
Resolution |
|
PE156 | Inclusion of external entities | |||
---|---|---|---|---|
| ||||
Problem statement |
From Dieter Köhler: Section 4.4.3: "If the entity is external, and the processor is not attempting to validate the XML document, the processor MAY, but need not, include the entity's replacement text." Should not the same apply if the entity is internal, but declared in the internal subset of a DTD after a reference to a parameter entity that the processor did not read? (See also 4.4.2 and the WFC Entity Declared of prod. 68.) | |||
Resolution |
|
PE157 | UTF-16 and Byte Order Mark | |||
---|---|---|---|---|
| ||||
Problem statement |
From Dieter Köhler: Appendix F.1 of the XML specs presents examples about how to automatically detect the encoding of an entity from the first characters of an XML encoding declaration without a byte order mark. These examples include UTF-16BE and UTF-16LE. However, section 4.3.3 says that entities encoded in UTF-16 MUST begin with a byte order mark. In the light of the examples it seems that the intention of the specs is to demand a UTF-16 byte order mark only when no XML declaration is used. Is this interpretation of the specs correct? If the answer is "yes", I would suggest to start the second paragraph of sect. 4.4.3 with: "In the absence of a text declaration (or an XML declaration respectively) entities encoded in UTF-16 MUST ..." If the answer is "no", I would suggest to remove the two incriminated examples from Appendix F.1 and to add an appropriate warning. | |||
Resolution |
|
PE158 | UTF-8 BOM | |||
---|---|---|---|---|
| ||||
Problem statement |
From John Cowan: I took up the question of the UTF-8 BOM with the Unicode Technical Committee after carefully reading what the Unicode Standard versions 4.0 and 5.0 have to say on the subject, thus: > > Am I correct in thinking that a conformant process that reads <EF > > BB BF> from the beginning of a byte stream that purports to be in > > the UTF-8 encoding scheme has the choice of discarding it as a BOM > > or accepting it as a ZWNBSP? I did not request a formal interpretative ruling, but Ken Whistler, one of the leading lights of the UTC, replied as follows: > I think in isolation, the answer to that would have to be > formally, yes, because <EF BB BF> at the start of a UTF-8 > byte stream is ambiguous. > > In a more complex context, where you could specify a conversion > going on between UTF-8 and one or more UTF-16 or UTF-32-based > encoding schemes, you could specify some instances where either > operation (discarding and not interpreting, or retaining and > interpreting as ZWNBSP) could be conformant or non-conformant. > It would depend on whether the operation willy-nilly changed > an intended BOM into a ZWNBSP (or vice versa), or retained the > intended meaning. (Note that the Unicode term "encoding scheme" corresponds to the IETF/W3C term "encoding".) I understand this to mean that if we wish to *require* <EF BB BF> to be interpreted as a BOM in a UTF-8 document (as I think we clearly do) we must spell the requirement out in the XML Recommendations and cannot rely on inheriting it from Unicode. In the case of a document entity, there is no ambiguity: U+FEFF cannot appear at the beginning. For an external entity, however, U+FEFF *can* appear at the beginning. | |||
Resolution |
|
PE159 | No < in Attribute Values | |||
---|---|---|---|---|
| ||||
Problem statement |
From Norm Walsh: In reviewing a test case, I discovered there was some confusion about this WFC: Well-formedness constraint: No < in Attribute Values The replacement text of any entity referred to directly or indirectly in an attribute value MUST NOT contain a <. The person who wrote the test case concluded that this WFC made the following document not well-formed: <!DOCTYPE foo [ <!ENTITY x "<"> <foo attr="&x;"/> I wonder if there's somewhere else in the spec that makes this clear, or if we want to consider an editorial clarification. | |||
Resolution | Append the following at the end of Appendix D (in XML 1.0) or Appendix C (in XML 1.1):
|
PE160 | Relax XML 1.0 rules for names | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||
Problem statement |
The idea is to change the rules for element and attribute names of XML 1.0 to match those of XML 1.1. | ||||||||||||||||||||
Resolution |
Add a new Appendix as follows: J Suggestions for XML Names (Non-Normative)The following suggestions define what is believed to be best practice in the construction of XML names used as element names, attribute names, processing instruction targets, entity names, notation names, and the values of attributes of type ID, and are intended as guidance for document authors and schema designers. All references to Unicode are understood with respect to a particular version of the Unicode Standard greater than or equal to 5.0; which version should be used is left to the discretion of the document author or schema designer. The first two suggestions are directly derived from the rules given for identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 5.0, and exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned codepoints, and white space characters. The other suggestions are mostly derived from Appendix B in previous editions of this specification.
|
PE161 | LEIRIs | |||
---|---|---|---|---|
| ||||
Problem statement |
The idea is to push the specification of how to process non-ASCII URIs in public identifiers out of the XML spec, to the upcoming revision of the IRI RFC. | |||
Proposed resolution |
|
PE162 | XML 1.1 version numbers | ||||
---|---|---|---|---|---|
| |||||
Problem statement |
XML 1.1 processors should attempt to parse documents with version numbers of the form "1.x", for any x > 0. | ||||
Discussion | It was decided 2008-01-02 to drop this PE. | ||||
Resolution |
|
PE163 | XML 1.0 version numbers | ||||
---|---|---|---|---|---|
| |||||
Problem statement |
XML 1.0 processors should attempt to parse documents with version numbers of the form "1.x", for any x ≥ 0. | ||||
Resolution |
|
PE164 | New Appendix J: Suggestions for XML Names | |||
---|---|---|---|---|
| ||||
Problem statement |
From John Cowan: The currently proposed Appendix J consists of suggestions for sensible XML 1.0 5th Edition names, and is directly cloned from XML 1.1. This text needs revision to bring it up to speed with Unicode. We are now in Unicode 5.0 rather than 3.0; 3.0 has been obsolete since March 2002. Unicode 5.0 also has a different way of recommending default identifiers which I propose we adopt: the basic idea "Use common sense" is still the same. Change the reference from Unicode 3.0 to Unicode 5.0. Change suggestion 1 to read: The first character of any name SHOULD have a Unicode property of ID_Start, or be one of the characters listed in the table entitled "Characters for Natural Language Identifiers" in UAX #31, an integral part of the Unicode Standard that is published separately. Change suggestion 2 to read: Characters other than the first SHOULD have the Unicode property ID_Continue, or be one of the characters listed in the table entitled "Characters for Natural Language Identifiers" in UAX #31, an integral part of the Unicode Standard that is published separately. The table in question includes hyphen, period, colon, and middle dot, as well as various script-specific characters with similar significance. The normative references in Section A.1 need some adjustments as well. Add: The Unicode Consortium. The Unicode Standard, Version 5.0.0, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0) The obsolete references to Unicode 2.0 and 3.2 don't do anything useful now that we no longer care about the Unicode 2.0 repertoire, so all references to Unicode throughout the Recommendation should be consolidated on this version. | |||
Resolution | The part about Appendix J is integrated in PE160. The Unicode references are addressed here.
Remove the [Unicode3] entry. |
Last updated $Date: 2009/09/15 18:26:26 $ by $Author: fyergeau $