W3C Architecture Domain XML

Running log of potential errata for XML 1.0 4th edition, XML 1.1 2nd edition and XML 1.0 5th edition

This document is an internal working document of the XML Core WG. It maintains a running log of potential errata to the XML 1.0 spec, 4th edition (dated 2006-08-16), to the XML 1.1 spec, 2nd edition (dated 2006-08-16) and to the XML 1.0 spec, 5th. edition (dated 2008-11-26). It is therefore the successor to the Running log of potential errata for XML 1.0 3rd edition and XML 1.1 1st edition. It is meant to be a living document, frequently updated as new errata are discovered and as they are disposed of by the WG.

When a potential erratum is resolved, its entry in this document is moved to the Resolved cases section and, if appropriate (it is a real erratum, not a false alarm or a request for enhancement that cannot be resolved by an erratum), the official XML 1.0 4th edition errata page, the official XML 1.1 2nd edition errata page or official XML 1.0 5th edition errata page is updated.

Shortcuts:

Potential errata

PE165Add note on Unicode normalization
Status: To be examinedCategory: EditorialImpacts: 1.0
Problem
statement

From Addison Phillips:

Dear XML Core WG,
          
I am writing on behalf of both the Internationalization Core WG and the HTML Coordination Group (HCG).

Recently there has been an extensive discussion of normalization in W3C specifications, mainly related
to handling of element and attribute names and values (as in CSS3 Selectors). Some of this discussion
revolves around how Unicode normalization should work with XML and XML-derived specifications,
hence I was actioned by HCG [0] to contact you folks.

I produced a general summary of the Unicode normalization problem at [1] for the HCG. Those unfamiliar
with Unicode normalization may wish to review that message.

The basic question is whether XML can (or should?) take a clearer stance on Unicode normalization. At
present, XML 1.0 5e, like its predecessors, does not require any particular normalization form; it says
nothing about whether canonical equivalents in Unicode are "equal" from an XML point of view; and thus
implies that Unicode canonical equivalence does *not* apply when considering an XML document's
formation. The recommendations in Appendix J (which does include normalization among its
suggestions) further suggest that this is true.

On the other hand, it seems reasonable to suppose that Unicode canonical equivalence might apply to
XML. Processes such as transcoding legacy charsets to Unicode might result in
canonically-equivalent-but-unequal code point sequences, for example. 

In a survey done at I18N's behest, our Unicode liaison (Mark Davis) produced a survey of content of the
Web, as well as a summary on performance [2], which found that 99.98% of Web HTML content was,
in fact, in Unicode form NFC. It seems reasonable to suppose that XML content and documents would
follow a similar pattern. 

Our questions to XML Core WG, thus, are:

What, precisely, should XML say with regard to Unicode canonical equivalence?

Would it be possible to require or allow canonical equivalents to be treated as identical directly in XML
(and not merely as a side effect of other specifications)?

Is there a problem if XML permits/requires canonically-equivalent-yet-different sequences to be treated
as distinct if other specifications require/allow canonical equivalence to be recognized?

The Internationalization Core WG would be happy to work with you on these thorny issues. Please advise
if you need more information, consultation, participation, or just need to vent :-).

Kind Regards,

Addison (for I18N/HCG)
        
Proposed
resolution
Section 2.2 Characters

Add a Note at the very end of the section as follows:

Note:

[Unicode] (conformance clause C06) says that canonically equivalent sequences of characters ought to be treated as identical. However, XML parsed entities (including document entities) that are canonically equivalent according to Unicode but which use distinct code point (character) sequences are considered distinct by XML processors. Therefore, all XML parsed entities SHOULD be created in a "fully normalized" form per [CharMod-Norm]. Otherwise the user might unknowingly create canonically equivalent but unequal sequences that appear identical to the user but which are treated as distinct by XML processors.

A document can still be well-formed, even if it is not in a normalized form. XML processors MAY verify that the document being processed is in a fully-normalized form and report to the application whether it is or not.

Section A.2 Other References

Add a reference to CharMod-Norm:

CharMod-Norm
W3C Working Draft. Character Model for the World Wide Web 1.0: Normalization. François Yergeau, Martin J. Dürst, Richard Ishida, Addison Phillips, Misha Wolf, Tex Texin. (See http://www.w3.org/TR/charmod-norm/.)
Rationale

 

Resolved and (when appropriate) published

PE122Revisiting E15 (from second edition errata)
Status: Resolved, not an erratumCategory: SubstantiveImpacts: 1.0 1.1
Problem
statement

From Jonathan Marsh:

I've been looking into E15 more fully.  Besides the backward
compatibility cost, there appears to be an implementability issue.

Microsoft parsers allow empty and element-only content to contain entity
references as long as those references expand to whitespace or to
nothing.  To do otherwise involves a substantial reworking of the parser
implementation strategy in use, making such a change very expensive, as
well as breaking any documents previously relying on this behavior
(though the number of such documents is likely to be small).

Our implementation difficulties are surfaced in the spec through an
obvious inconsistency in the spec.  Validation of attributes is done
after entity expansion (according to E20), but prior to character entity
expansion in elements (according to E15).  There appears to be no clear
reason why these contexts must differ.

Microsoft parsers accept all documents conformant to this erratum, but
may also accept some documents (which are unlikely to occur in the wild)
which do not conform to the constraints of this erratum.  In particular,
we fail the following test cases (by parsing each document without
error):

E15a.xml:

<!DOCTYPE foo [
<!ELEMENT foo EMPTY>
<!ENTITY empty "">
]>
<foo>&empty;</foo>

E15g.xml:

<!DOCTYPE foo [
<!ELEMENT foo (foo*)>
]>
<foo><foo/>&#32;<foo/></foo>

E15h.xml

<!DOCTYPE foo [
<!ELEMENT foo (foo*)>
<!ENTITY space "&#38;#32;">
]>
<foo><foo/>&space;<foo/></foo>
Discussion

From Jonathan Marsh:

> 3) Microsoft is problematic because unclear, Microsoft needs to tell us
> what issue exactly they have with E15.

A description of the problem with E15 is at
http://lists.w3.org/Archives/Member/w3c-xml-core-wg/2003OctDec/0227.html.

The main objections are:
1) difficulty of implementation
2) entity expansion in attributes and elements is treated inconsistently (E20 vs. E15)
3) not backward compatible with deployed versions of MSXML

I note that failure to support E15 does not affect the infoset of the parsed document.


I would like to propose a solution, but I'm actually having trouble understanding
the erratum and how it led to the test cases we fail.  Perhaps somebody could help
me understand the test cases better.

In http://www.w3.org/TR/2003/PER-xml-20031030/PER-xml-20031030-review.html#elementvalid,
I find:

"... however, a reference to an internal entity with a literal value consisting
of character references expanding to white space does match S, since its
replacement text is the white space resulting from expansion of the character
references."

I think this specifically makes the following test case valid:

  <!DOCTYPE foo [
  <!ELEMENT foo (foo*)>
  <!ENTITY space "&#32;">
  ]>
  <foo><foo/>&space;<foo/></foo>

If that is the case, it is hard to see why the test cases we have problems
with are not valid:

E15g.xml:
  <!DOCTYPE foo [
  <!ELEMENT foo (foo*)>
  ]>
  <foo><foo/>&#32;<foo/></foo>

E15h.xml
  <!DOCTYPE foo [
  <!ELEMENT foo (foo*)>
  <!ENTITY space "&#38;#32;">
  ]>
  <foo><foo/>&space;<foo/></foo>

And for consistency the similar situation for EMPTY content:

E15a.xml:
  <!DOCTYPE foo [
  <!ELEMENT foo EMPTY>
  <!ENTITY empty "">
  ]>
  <foo>&empty;</foo>

From Richard Tobin:

I think the idea is that all the pointless possibilities that can be ruled
out, are ruled out.

> I think this specifically makes the following test case valid:
>
>   <!DOCTYPE foo [
>   <!ELEMENT foo (foo*)>
>   <!ENTITY space "&#32;">
>   ]>
>   <foo><foo/>&space;<foo/></foo>

Yes.  Entity references to space-separated sequences of elements have to be
valid, and this is just an empty sequence.  The character reference can't
be ruled out because it's gone by the time you know what the entity is
being used for.

> If that is the case, it is hard to see why the test cases we have
> problems with are not valid:
>
> E15g.xml:
>   <!DOCTYPE foo [
>   <!ELEMENT foo (foo*)>
>   ]>
>   <foo><foo/>&#32;<foo/></foo>
>
> E15h.xml
>   <!DOCTYPE foo [
>   <!ELEMENT foo (foo*)>
>   <!ENTITY space "&#38;#32;">
>   ]>
>   <foo><foo/>&space;<foo/></foo>

Well, if you regard it as a deficiency that the first case can't be ruled
out, then it's a deficiency that does not apply to these cases.

> And for consistency the similar situation for EMPTY content:
>
> E15a.xml:
>   <!DOCTYPE foo [
>   <!ELEMENT foo EMPTY>
>   <!ENTITY empty "">
>   ]>
>   <foo>&empty;</foo>

There is no good use for an entity reference in an EMPTY element, so
it can be ruled out.
Resolution

There is nothing we can do about that.

 

PE139Changing XML spec to use IRIs for system IDs
Status: Resolved, not an erratumCategory: SubstantiveImpacts: 1.0 1.1
Problem
statement

From Richard Tobin:

In 4.2.2, replace the paragraph beginning "System identifiers (and other ..."
and the following 3-item list with:

System identifiers (and other XML strings meant to be used as URI
references) are converted to URI references as described in [IRIs RFC
3987].  They MAY contain characters that, according to [new URIs
RFC3986], must be escaped before a URI can be used to retrieve the
referenced resource.  XML processors MUST escape them as described in
section 3.1 of [IRIs RFC 3987].

We may want to include this existing text:

 Since escaping is not always a fully reversible process, it MUST be
 performed only when absolutely necessary and as late as possible in a
 processing chain. In particular, neither the process of converting a
 relative URI to an absolute one nor the process of passing a URI
 reference to a process or software component responsible for
 dereferencing it SHOULD trigger escaping.

but I am uncertain as to whether [Base URI] in the infoset should have
them escaped.
Discussion

From the minutes of 2005-04-20:

But [Richard] points out that he doesn't suggest we make this
change until and unless we change the references to 2396
to 3986.

Richard suggests we defer this erratum for now.

CONSENSUS to defer this erratum for now.
Resolution

2007-12-05: Superceded by PE161.

 

PE150ISO 639 and 3166
Status: Published 2007-08-15Category: SubstantiveImpacts: 1.0 1.1
Problem
statement

From Addison Phillips:

1. Section 1.1 contains this paragraph:

--
This specification, together with associated standards (Unicode 
[Unicode] and ISO/IEC 10646 [ISO/IEC 10646] for characters, Internet RFC 
3066 [IETF RFC 3066] for language identification tags, ISO 639 [ISO 639] 
for language name codes, and ISO 3166 [ISO 3166] for country name 
codes), provides all the information necessary to understand XML Version 
1.0 and construct computer programs to process it.
-- 

I think the references to ISO 639 and ISO 3166 should be replaced with a 
reference to the IANA Language Subtag Registry. I note that this is the 
only place that these references appear.

2. I recognize that there is difficulty in replacing the reference to 
RFC 3066 currently, even though that document is now obsolete, since 
3066bis has not be published by the RFC Editor.

3. Section A2 contains a reference [IANA-LANGCODES] which is not 
referenced anywhere. Furthermore, this reference is to the now obsolete 
and closed Language Tag registry. Changing it to the Language Subtag 
registry would be more appropriate.

The new URL is: http://www.iana.org/assignments/language-subtag-registry
Discussion

The successor to RFC 3066 has now been published as RFC 4646.

Resolution
Section 1.1 Origin and Goals

Amend the first paragraph after the list of goals, so that it reads:

This specification, together with associated standards (Unicode [Unicode] and ISO/IEC 10646 [ISO10646] for characters, Internet RFC 3066BCP 47 [RFC1766] and the Language Subtag Registry [IANA-LANGCODES] for language identification tags, ISO 639 [ISO639] for language name codes, and ISO 3166 [ISO3166] for country name codes), provides all the information necessary to understand XML Version [1.0 | 1.1]; and construct computer programs to process it.
Section A.1 Normative References

Change the [IETF RFC 3066] entry so that it points to IETF BCP 47.

Section A.2 Other References

Change the [IANA-LANGCODES] entry so that it points to the new registry at http://www.iana.org/assignments/language-subtag-registry.

Rationale
RFC 3066 has been replaced by RFC 4646 and RFC 4647, collectively known as IETF BCP 47. The old registry pointed to by the IANA-LANGCODES entry is now stale and closed. With the new registry, reference to ISO 639 and ISO 3166 is no longer necessary (and may even be harmful in the future, because of stability concerns).

 

PE151missing 16 non-characters
Status: Published 2007-08-15Category: SubstantiveImpacts: 1.0 1.1
Problem
statement

From Frank Ellermann:

Hi, the page http://www.w3.org/TR/REC-xml/#charsets mentions in
a note, that some characters are discouraged, in essence all C1
controls excl. NEL and the 66 non-characters. 

The 66 non-characters consist of 17 planes * 2 (??FFFE, ??FFFF)
and 32 u+FDD0 up to u+FDEF.  This XML 1.0 document says u+FDDF
instead of u+FDEF.

The page http://www.unicode.org/Public/UNIDATA/DerivedAge.txt
claims that these 32 non-characters were introduced together in
Unicode version 3.1.   The normative [Unicode3] reference in
XML 1.0 4th ed. is based on Unicode version 3.2.
Resolution
Section 2.2 Characters

Amend the [#xFDD0-#xFDDF] range in the list of discouraged characters to read [#xFDD0-#xFDEF].

Rationale
"#xFDDF" was a typo, as can be ascertained by consulting the Unicode standard in which the whole [#xFDD0-#xFDEF] range was introduced as a block in one version (3.1) to serve as "non-characters" for internal use.

 

PE152not obligated to to
Status: Published 2007-09-25Category: EditorialImpacts: 1.0 1.1
Problem
statement

From Dieter Köhler:

Section 4.1, WFC: Entity Declared, last paragraph:
"... non-validating processors are not obligated to to read ..."
should be changed to
"... non-validating processors are not obligated to read ..."
Resolution
Section 4.1 Character and Entity References

Remove a duplicate "to" from the last paragraph of the description of the "Entity Declared" WFC, so that it reads:

Note that non-validating processors are not obligated to to read and process entity declarations occurring in parameter entities or in the external subset; for such documents, the rule that an entity must be declared is a well-formedness constraint only if standalone='yes'.
Rationale
This was a typo.

 

PE153facilities /not/ related to validation
Status: Published 2007-09-25Category: EditorialImpacts: 1.1
Problem
statement

From Dieter Köhler:

In section 5.2, last paragraph:
"Applications which require DTD facilities not related to validation (such 
as the declaration of default attributes and internal entities that are or 
may be specified in external entities ) SHOULD use validating XML processors."
the first "not" seems to be wrong and the sentence contains a superfluous 
space before the closing bracket. It should be changed to:
"Applications which require DTD facilities related to validation (such as 
the declaration of default attributes and internal entities that are or may 
be specified in external entities) SHOULD use validating XML processors."
Discussion

The not is actually intended, should not be removed.

The superfluous space is present only in 1.1, should be removed

Resolution

In the 1.1 spec only:

Section 5.2 Using XML Processors

In the last sentence of the last paragraph, remove a superfluous space just before a closing parenthese.

Rationale
This was a typo.

 

PE154linking to WFCs in prod 60
Status: Resolved, not an erratumCategory: EditorialImpacts: 1.0 1.1
Problem
statement

From Dieter Köhler:

It also appears to me inconvenient that in section 3.3.2, prod. 60 the 
"WFC: No < in Attribute Values" and "WFC: No External Entity References" 
have no visible clue, where the text of these WFCs can be found.  The WFCs 
are linked correctly, but no reference is available for the reader of a 
print-out of the spec.  Therefore I suggest amending the text of the links 
with "see prod. 41". For consistency, I would also like to suggest changing 
the order of the WFCs to match those of prod. 41 and moving them to the end 
of the list. The list of VCs and WFCs of prod. 60 should look like:
     [VC: Required Attribute]
     [VC: Attribute Default Value Syntactically Correct]
     [VC: Fixed Attribute Default]
     [WFC: No External Entity References, see prod. 41]
     [WFC: No < in Attribute Values, see prod. 41]
Resolution

Nice suggestion, but it turns out to be too diffficult to implement as the link text is generated by the stylesheet and there is no place to indicate the desired target (prod. 41 here).

 

PE155prod. 68, VC Entity Declared
Status: Published 2008-01-18Category: EditorialImpacts: 1.0 1.1
Problem
statement

From Dieter Köhler:

Section 4.1, prod. 68, VC Entity Declared:
>>In a document with an external subset or parameter entity references 
with " standalone='no'  ", the Name ...<<
Here the scope of the condition >>with " standalone='no' "<< is ambiguous. 
In order to be consistent with the WFC Entity Declared the condition must 
apply to both, "external subset" and "parameter entity references", because 
in a document with an external subset and standalone='yes' a missing entity 
declaration is a well-formedness error. However the wording allows two 
options: "In a document with (A or B) with C" or "In a document with A or 
(B with C)". Of course one can rule out the second option as false on 
carefully comparing the wording of the VC Entity Declared with that of the 
WFC Entity Declared. But it is not easy to figure it out.
However, there is a second problem: The condition of "standalone='no'" is 
equivalent to the condition that no standalone declaration exists, which 
can be inferred from the rule in section 2.9: "If there are external markup 
declarations but there is no standalone document declaration, the value 
'no' is assumed." For clarification it would be good to remind the reader 
of this rule, in particular because the Courier type face of the words 
"standalone='no'" puts an emphasis on an explicit standalone declaration 
which is not intended.
To summarize my suggestion, I would recommend that the sentence
>>In a document with an external subset or parameter entity references 
with " standalone='no'  ", the Name ...<<
should be changed to something like
>>For a document with "standalone = 'no'" or no standalone declaration, if 
this document has a DTD with an external subset or parameter entity 
references in its internal subset, the Name ...<<
Resolution
Section 4.1 Character and Entity References

Change the first sentence of the text of the Entity Declared VC as follows:

In a document with an external subset or parameter entity references, if the document is not standalone (either "standalone='no'" is specified or there is no standalone declaration), then the Name given in the entity reference MUST match that in an entity declaration.
Rationale
The existing wording was ambiguous and did not explicitly address the case of an absent standalone declaration.

 

PE156Inclusion of external entities
Status: Published 2007-12-05Category: EditorialImpacts: 1.0 1.1
Problem
statement

From Dieter Köhler:

Section 4.4.3:
"If the entity is external, and the processor is not attempting to validate 
the XML document, the processor MAY, but need not, include the entity's 
replacement text."
Should not the same apply if the entity is internal, but declared in the 
internal subset of a DTD after a reference to a parameter entity that the 
processor did not read? (See also 4.4.2 and the WFC Entity Declared of 
prod. 68.)
Resolution
Section 5.2 Using XML Processors

Amend the last sentence of the second item of the bulleted list so that it reads:

For example, a non-validating processor may fail to normalize attribute values, include the replacement text of internal entities, or supply default attribute values, where doing so depends on having read declarations in external or parameter entities, or in the internal subset after an unread parameter entity reference.
Rationale
Improve the informativeness of the example sentence.

 

PE157UTF-16 and Byte Order Mark
Status: Published 2007-12-05Category: EditorialImpacts: 1.0 1.1
Problem
statement

From Dieter Köhler:

Appendix F.1 of the XML specs presents examples about how to automatically 
detect the encoding of an entity from the first characters of an XML 
encoding declaration without a byte order mark.  These examples include 
UTF-16BE and UTF-16LE. However, section 4.3.3 says that entities encoded in 
UTF-16 MUST begin with a byte order mark.

In the light of the examples it seems that the intention of the specs is to 
demand a UTF-16 byte order mark only when no XML declaration is used.  Is 
this interpretation of the specs correct?

If the answer is "yes", I would suggest to start the second paragraph of 
sect. 4.4.3 with: "In the absence of a text declaration (or an XML 
declaration respectively) entities encoded in UTF-16 MUST ..."

If the answer is "no", I would suggest to remove the two incriminated 
examples from Appendix F.1 and to add an appropriate warning.
Resolution
Section 4.3.3 Character Encoding in Entities

Change the second sentence of the first paragraph to read:

The terms "UTF-8" and "UTF-16" in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16. related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8.
Section F Autodetection of Character Encodings

Change the last sentence of the first paragraph to read:

We will consider the first case first. these cases in turn.
Rationale
The former reading in 4.3.3 still caused some confusion, especially with respect to UTF-16BE and UTF-16LE.

 

PE158UTF-8 BOM
Status: Published 2007-12-05Category: SubstantiveImpacts: 1.0 1.1
Problem
statement

From John Cowan:

I took up the question of the UTF-8 BOM with the Unicode Technical
Committee after carefully reading what the Unicode Standard
versions 4.0 and 5.0 have to say on the subject, thus:

> > Am I correct in thinking that a conformant process that reads <EF
> > BB BF> from the beginning of a byte stream that purports to be in
> > the UTF-8 encoding scheme has the choice of discarding it as a BOM
> > or accepting it as a ZWNBSP?

I did not request a formal interpretative ruling, but Ken Whistler,
one of the leading lights of the UTC, replied as follows:

> I think in isolation, the answer to that would have to be
> formally, yes, because <EF BB BF> at the start of a UTF-8
> byte stream is ambiguous.
> 
> In a more complex context, where you could specify a conversion
> going on between UTF-8 and one or more UTF-16 or UTF-32-based
> encoding schemes, you could specify some instances where either
> operation (discarding and not interpreting, or retaining and
> interpreting as ZWNBSP) could be conformant or non-conformant.
> It would depend on whether the operation willy-nilly changed
> an intended BOM into a ZWNBSP (or vice versa), or retained the
> intended meaning.

(Note that the Unicode term "encoding scheme" corresponds to the
IETF/W3C term "encoding".)

I understand this to mean that if we wish to *require* <EF BB BF>
to be interpreted as a BOM in a UTF-8 document (as I think we clearly
do) we must spell the requirement out in the XML Recommendations and
cannot rely on inheriting it from Unicode.  In the case of a document
entity, there is no ambiguity: U+FEFF cannot appear at the beginning.
For an external entity, however, U+FEFF *can* appear at the beginning.
Resolution
Section 4.3.3 Character Encoding in Entities

Add a new paragraph following the second paragraph, to read:

If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16.
Rationale

 

PE159No < in Attribute Values
Status: Published 2007-12-05Category: EditorialImpacts: 1.0 1.1
Problem
statement

From Norm Walsh:

In reviewing a test case, I discovered there was some confusion about
this WFC:

  Well-formedness constraint: No < in Attribute Values

  The replacement text of any entity referred to directly or
  indirectly in an attribute value MUST NOT contain a <.

The person who wrote the test case concluded that this WFC made the
following document not well-formed:
            
<!DOCTYPE foo [
<!ENTITY x "&lt;">
<foo attr="&x;"/>
                
I wonder if there's somewhere else in the spec that makes this clear,
or if we want to consider an editorial clarification.
Resolution

Append the following at the end of Appendix D (in XML 1.0) or Appendix C (in XML 1.1):

Section D Expansion of Entity and Character References

In the following example

<!DOCTYPE foo [ 
<!ENTITY x "&lt;"> 
]> 
<foo attr="&x;"/>

the replacement text of x is the four characters "&lt;" because references to general entities in entity values are bypassed. The replacement text of lt is a character reference to the less-than character, for example the five characters "&#60;" (see 4.6 Predefined Entities). Since neither of these contains a less-than character the result is well-formed.

If the definition of x had been

<!ENTITY x "&#60;">

then the document would not have been well-formed, because the replacement text of x would be the single character "<" which is not permitted in attribute values (see WFC: No < in Attribute Values).

Rationale
This is an editorial clarification of a case that remained confusing.

 

PE160Relax XML 1.0 rules for names
Status: Published 2008-01-18Category: SubstantiveImpacts: 1.0
Problem
statement

The idea is to change the rules for element and attribute names of XML 1.0 to match those of XML 1.1.

Resolution
Section 2.3 Common Syntactic Constructs

Delete the following paragraph:

Characters are classified for convenience as letters, digits, or other characters. A letter consists of an alphabetic or syllabic base character or an ideographic character. Full definitions of the specific characters in each class are given in B Character Classes.

Replace the group of productions [4] to [8], including the "Names and Tokens" heading, with the following:

The first character of a Name MUST be a NameStartChar, and any other characters MUST be NameChars; this mechanism is used to prevent names from beginning with European (ASCII) digits or with basic combining characters. Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names. See J Suggestions for XML Names for suggestions on the creation of names.

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or white space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references.

Names and Tokens
[4]   NameStartChar   ::=   ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]   NameChar   ::=    NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]   Name   ::=    NameStartChar (NameChar)*
[6]   Names   ::=    Name (#x20 Name)*
[7]   Nmtoken   ::=   (NameChar)+
[8]   Nmtokens   ::=    Nmtoken (#x20 Nmtoken)*

Section B Character Classes

Replace the entire Appendix with the following


Add a new Appendix as follows:

J Suggestions for XML Names (Non-Normative)

The following suggestions define what is believed to be best practice in the construction of XML names used as element names, attribute names, processing instruction targets, entity names, notation names, and the values of attributes of type ID, and are intended as guidance for document authors and schema designers. All references to Unicode are understood with respect to a particular version of the Unicode Standard greater than or equal to 5.0; which version should be used is left to the discretion of the document author or schema designer.

The first two suggestions are directly derived from the rules given for identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 5.0, and exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned codepoints, and white space characters. The other suggestions are mostly derived from Appendix B in previous editions of this specification.

  1. The first character of any name SHOULD have a Unicode property of ID_Start, or else be '_' #x5F.

  2. Characters other than the first SHOULD have a Unicode property of ID_Continue, or be one of the characters listed in the table entitled "Characters for Natural Language Identifiers" in UAX #31, with the exception of "'" #x27 and "’" #x2019.

  3. Ideographic characters which have a canonical decomposition (including those in the ranges [#xF900-#xFAFF] and [#x2F800-#x2FFFD], with 12 exceptions) should not be used in names.

  4. Characters which have a compatibility decomposition (those with a "compatibility formatting tag" in field 5 of the Unicode Character Database -- marked by field 5 beginning with a "<") should not be used in names. This suggestion does not apply to #x0E33 THAI CHARACTER SARA AM or #x0EB3 LAO CHARACTER AM, which despite their compatibility decompositions are in regular use in those scripts.

  5. Combining characters meant for use with symbols only (including those in the ranges [#x20D0-#x20EF] and [#x1D165-#x1D1AD]) should not be used in names.

  6. The interlinear annotation characters ([#xFFF9-#xFFFB]) should not be used in names.

  7. Variation selector characters should not be used in names.

  8. Names which are nonsensical, unpronounceable, hard to read, or easily confusable with other names should not be employed.

 

PE161LEIRIs
Status: Approved 2008-01-16, not yet publishedCategory: SubstantiveImpacts: 1.0 1.1
Problem
statement

The idea is to push the specification of how to process non-ASCII URIs in public identifiers out of the XML spec, to the upcoming revision of the IRI RFC.

Proposed
resolution
Section 4.2.2 External Entities

Replace the second paragraph (starting with "System identifiers (and other XML strings meant ...") and the subsequent three-item list with the following:

System identifiers (and other XML strings meant to be used as URI references) are LEIRIs, as defined in Section 7 "Legacy Extended IRIs" of [IETF RFC 3987bis]. Failure to match the applicable syntax productions of [IETF RFC 3987bis] is not an error. Note, however, that system identifiers which do not conform to the LEIRI syntax are not in practice likely to be useful.
Section A.1 Normative References

Add a new entry:

IETF RFC 3987bis
IETF (Internet Engineering Task Force). RFC 3987bis:Internationalized Resource Identifiers (IRIs). M. Duerst, M. Suignard, 2008. (See http://www.ietf.org/rfc/rfc3987bis.txt.)
Rationale
The upcoming revision of RFC 3987 aims to provide a single referenceable spec for things like XML system identifiers that are not exactly IRIs but can be reduced to IRIs. This change is planned to use this spec, instead of having similar language repeated in many XML-related specs.

 

PE162XML 1.1 version numbers
Status: Resolved, not an erratumCategory: SubstantiveImpacts: 1.1
Problem
statement

XML 1.1 processors should attempt to parse documents with version numbers of the form "1.x", for any x > 0.

DiscussionIt was decided 2008-01-02 to drop this PE.
Resolution
Section 2.8 Prolog and Document Type Declaration

Alter production [26] so that it reads:

[26]   VersionNum   ::=   '1.' [1-9]

Add a new paragraph immediately after production [27] as follows:

Even though the VersionNum production matches any version number of the form '1.x', XML 1.1 documents SHOULD NOT specify a version number other than '1.1'.

 

PE163XML 1.0 version numbers
Status: Published 2008-01-18Category: SubstantiveImpacts: 1.0
Problem
statement

XML 1.0 processors should attempt to parse documents with version numbers of the form "1.x", for any x ≥ 0.

Resolution
Section 2.8 Prolog and Document Type Declaration

Alter production [26] so that it reads:

[26]   VersionNum   ::=   '1.' [0-9]+

Add a new paragraph immediately after production [27] as follows:

Even though the VersionNum production matches any version number of the form '1.x', XML 1.0 documents SHOULD NOT specify a version number other than '1.0'.

Note:

When an XML 1.0 processor encounters a document that specifies a 1.x version number other than '1.0', it will process it as a 1.0 document. This means that an XML 1.0 processor will accept 1.x documents provided they do not use any non-1.0 features.

 

PE164New Appendix J: Suggestions for XML Names
Status: Published 2008-01-18Category: SubstantiveImpacts: 1.0
Problem
statement

From John Cowan:

The currently proposed Appendix J consists of suggestions for sensible XML
1.0 5th Edition names, and is directly cloned from XML 1.1.  This text
needs revision to bring it up to speed with Unicode.  We are now in
Unicode 5.0 rather than 3.0; 3.0 has been obsolete since March 2002.
Unicode 5.0 also has a different way of recommending default identifiers which
I propose we adopt: the basic idea "Use common sense" is still the same.

Change the reference from Unicode 3.0 to Unicode 5.0.

Change suggestion 1 to read:

    The first character of any name SHOULD have a Unicode property
    of ID_Start, or be one of the characters listed in the table
    entitled "Characters for Natural Language Identifiers" in UAX
    #31, an integral part of the Unicode Standard that is published
    separately.

Change suggestion 2 to read:

    Characters other than the first SHOULD have the Unicode property
    ID_Continue, or be one of the characters listed in the table
    entitled "Characters for Natural Language Identifiers" in UAX
    #31, an integral part of the Unicode Standard that is published
    separately.

The table in question includes hyphen, period, colon, and middle dot,
as well as various script-specific characters with similar significance.

The normative references in Section A.1 need some adjustments as well.
Add:

    The Unicode Consortium. The Unicode Standard, Version 5.0.0,
    defined by: The Unicode Standard, Version 5.0 (Boston, MA,
    Addison-Wesley, 2007. ISBN 0-321-48091-0)

The obsolete references to Unicode 2.0 and 3.2 don't do anything
useful now that we no longer care about the Unicode 2.0 repertoire,
so all references to Unicode throughout the Recommendation should be
consolidated on this version.
Resolution

The part about Appendix J is integrated in PE160. The Unicode references are addressed here.

Section 2.2 Characters

In the pagaraph following production [2], change the reference to [Unicode3] to point to [Unicode]:

The mechanism for encoding character code points into bit patterns may; vary from entity to entity. All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode 3.1 [Unicode]; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in 4.3.3 Character Encoding in Entities.

Amend the first paragraph of the following Note to read:

Note:

Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

Section 4.2.2 External Entities

In the first item of the numbered list, change the reference to [Unicode3] to point to [Unicode]:

  1. Each character to be escaped is represented in UTF-8 [Unicode] as one or more bytes.

Section 4.3.3 Character Encoding in Entities

Amend the second paragraph to read:

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO10646-2000], section 16.8 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.

Amend the next-to-last paragraph to read:

It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any irregularill-formed code unit sequences, as defined in section 3.9 of Unicode 3.1 [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16.
Section A.1 Normative References

Amend the [Unicode] entry to read:

Unicode
The Unicode Consortium. The Unicode Standard, Version 5.0.0, Reading, Mass.: Addison-Wesley Developers Press, 1996defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0).

Remove the [Unicode3] entry.

Unicode3

 


Last updated $Date: 2009/09/15 18:26:26 $ by $Author: fyergeau $