W3C

XML 1.1

W3C Working Draft 25 April 2002

This Version:
http://www.w3.org/TR/2002/WD-xml11-20020425/
Latest Version:
http://www.w3.org/TR/xml11/
Previous Version:
http://www.w3.org/TR/2001/WD-xml11-20011213/
Editors:
John Cowan, Reuters < jcowan@reutershealth.com >

Abstract

This document describes XML 1.1, a deliverable of the XML Core Working Group as defined in the XML Blueberry Requirements . XML 1.1 was formerly known as XML Blueberry. This document takes the form of a series of alterations to the XML 1.0 Recommendation [XML1.0], and its numbered sections correspond to those of the XML 1.0 Recommendation. Sections of that Recommendation that do not appear in this document remain unchanged in XML 1.1.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This is a Last Call Working Draft of the XML Core Working Group (member only), for review by W3C members and other interested parties. This document has been produced as part of the XML Activity , and may eventually be advanced toward W3C Recommendation status.

Being a Working Draft document, this specification may be updated, replaced, or obsoleted by other documents at any time. The test cases described and referred to in this document may also be updated, replaced or obsoleted at an any time. It is therefore inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C working drafts can be found at http://www.w3.org/TR/ .

This draft document will be considered by the W3C and its members according to W3C process. This document is made public for the purpose of receiving comments that inform the W3C membership and staff on issues likely to affect the implementation, acceptance, and adoption of XML 1.1.

While this and subsequent drafts of this specification will be written as a series of alterations to the XML 1.0 Recommendation to facilitate editing and review, it is likely that the final XML 1.1 Recommendation will take the form of an integral revision of the XML 1.0 specification.

We explicitly invite comments on this draft. The Last Call review period ends at 2359Z on 28 June 2002.  Comments should be sent to www-xml-blueberry-comments@w3.org . This is the preferred method of providing feedback. Public comments and their responses can be accessed at http://lists.w3.org/Archives/Public/www-xml-blueberry-comments/ .


Table of Contents


Introduction

The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2000, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 3.1 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0.

The overall philosophy of names has changed since XML 1.0. Whereas XML 1.0 provided a rigid definition of names, wherein everything that was not permitted was forbidden, XML 1.1 names are designed so that everything that is not forbidden (for a specific reason) is permitted. Since Unicode will continue to grow past version 3.1, further changes to XML can be avoided by allowing almost any character, including those not yet assigned, in names.

In addition, XML 1.0 attempts to adapt to the line-end conventions of various modern operating systems, but discriminates against the conventions used on IBM and IBM-compatible mainframes. As a result, XML documents on mainframes are not plain text files according to the local conventions. XML 1.0 documents generated on mainframes must either violate the local line-end conventions, or employ otherwise unnecessary translation phases before parsing and after generation. Allowing straightforward interoperability is particularly important when data stores are shared between mainframe and non-mainframe systems (as opposed to being copied from one to the other). For completeness, the Unicode line separator character, #x2028, is also supported.

A new XML version, rather than a set of errata to XML 1.0, is being created because the changes affect the definition of well-formed documents. XML 1.0 processors must continue to reject documents that contain new characters in XML names or new line-end conventions. The distinction between XML 1.0 and XML 1.1 documents will be indicated by the version number information in the XML declaration at the start of each document.

2.1 Well-Formed XML Documents

Add the following point to the definition of "well-formed", renumbering point 3 to point 4:

3. It meets the character normalization constraints given in section 2.13 .

2.3 Common Syntactic Constructs

Change production [3] to read:

 [3]    S    ::=    (#x9 | #x20 | #xA | #xD | #x85 | #x2028)+

Change the preceding text to read:

S (white space) consists of one or more space (#x20), tab, carriage return, line feed, newline, or Unicode line separator characters.

Change production [4], and add new production [4a]:

 [4]    NameStartChar    :=   ":" | [A-Z] | "_" | [a-z] | [#xC0-#x02FF] |

[#x0370-#x037D] | [#x037F-#x2027] | [#x202A-#x218F] | [#x2800-#xD7FF] |

[#xE000-#xFDCF] | [#xFDF0-#xFFEF] | [#x10000-#x10FFFF]

 [4a]    NameChar := NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F]

Change production [5] to:

 [5]    Name    ::=   NameStartChar NameChar*

Insert the following three paragraphs after production 5:

The first character of a Name must be a NameStartChar, but any other characters are NameChars; this mechanism is used to prevent names from beginning with Latin (ASCII) digits or with basic combining characters. Almost all characters are permitted in names, except those which either are or reasonably could be used as delimiters. The intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names.  See Appendix B for suggestions on the creation of names.

Document authors are encouraged to use names which are meaningful words or combinations of words in natural languages, and to avoid symbolic or whitespace characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period), LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.

The ASCII symbols and punctuation marks, along with a fairly large group of Unicode symbol characters, are excluded from names because they are more useful as delimiters in contexts where XML names are used outside XML documents; providing this group gives those contexts hard guarantees about what cannot be part of an XML name. The character #x037E, GREEK QUESTION MARK, is excluded because when normalized it becomes a semicolon, which could change the meaning of entity references.

Change production [7] to:

 [7]    Nmtoken    ::=   NameChar+

2.8 Prolog and Document Type Declaration

Change "1.0" everywhere to "1.1"

Add the following paragraph:

XML 1.1 processors should accept XML 1.0 documents as well. If a document is well-formed or valid XML 1.0, and provided it is fully normalized as defined in Section 2.13, it may be made well-formed or valid XML 1.1 respectively simply by changing the version number.

2.11 End-of-Line Handling

Replace the second paragraph with:

To simplify the tasks of applications, the characters passed to an application by the XML processor must be as if the XML processor normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:

2.13 Normalization Checking [NEW]

In order to be well-formed, all XML parsed entities (including document entities ) must be fully normalized as per the definition of [Charmod] supplemented by the following definitions of relevant constructs for XML:

It is a fatal error for a parsed entity not to be in fully normalized form. XML processors must, at user option, verify that each input entity is in fully normalized form; the option to not verify must be chosen only when the input text is certified, as defined by [Charmod].

The verification of full normalization must be carried out by first verifying that the entity is in include-normalized form as defined by [Charmod] and by then verifying that none of the relevant constructs listed above begins (after character references are expanded) with a composing character as defined by [Charmod]. Non-validating processors must ignore the possible denormalizations that would be caused by inclusion of external entities that they do not read.

Note: The composing characters are all Unicode characters of non-zero combining class, plus a small number of class-zero characters that nevertheless take part as a non-initial character in certain Unicode canonical decompositions. Since these characters are meant to follow base characters, restricting relevant constructs (including content) to not begin with a composing character does not meaningfully diminish the expressiveness of XML.

If, while verifying full normalization, a processor encounters characters for which it cannot determine the normalization properties (e.g., characters introduced in a version of [Unicode] later than the one used in the implementation of the processor), then the processor may, at user option, ignore any possible denormalizations caused by these characters. The option to ignore those denormalizations should not be chosen by applications when reliability or security are critical.

Note that the normalization requirements may, in principle, make certain XML 1.0 documents not well formed XML 1.1 The Working Group believes that such documents are either nonexistent or vanishingly rare in practice.

XML processors must not transform the input to be in fully normalized form.

The purpose of this section is to require XML processors to ensure that the creators of XML documents have properly normalized them, so that XML applications can make tests such as identity comparisons of strings without having to worry about the different possible "spellings" of characters which Unicode allows.

The Core WG seeks guidance from the community on the appropriateness of this provision for XML 1.1.

3.3.3 Attribute-Value Normalization 

Add #x85 and #x2028 to the lists "(#x20, #xD, #xA, #x9)" and "(#xD, #xA or #x9)".

4.3.4 Version Information in Entities [NEW]

Each entity, including the document entity, can be separately declared as XML 1.0 or XML 1.1. The version declaration appearing in the document entity determines the version of the document as a whole. An XML 1.1 document may invoke XML 1.0 external entities, so that otherwise duplicated versions of external entities, particularly DTD external subsets, need not be maintained. However, in such a case the rules of XML 1.1 are applied to the entire document.  The version number of the document entity specifies the rules (XML 1.0 or XML 1.1) applied to the document.

If an entity (including the document entity) is not labeled with a version number, it is treated as if labeled as version 1.0.

Appendix A References

Add the following normative references:

[XML1.0]
Tim Bray, Jean Paoli, C.M. Sperberg-McQueen, Eve Maler (editors), Extensible Markup Language (XML) 1.0 (Second Edition), 6 October 2000.  (See http://www.w3.org/TR/REC-xml .)
[Charmod]
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Asmus Freytag, Tex Texin Character Model for the World Wide Web, W3C Working Draft, 20 February 2002.   (See http://www.w3.org/TR/charmod/.)

Appendix B Suggestions for XML Names (Non-Normative)

Appendix B is to be changed from a normative appendix called "Character Classes" to a non-normative one called "Suggestions for XML Names", with the following content.

The following suggestions define what is believed to be best practice in the construction of XML names used as element names, attribute names, processing instruction targets, entity names, notation names, and the values of attributes of type ID, and are intended as guidance for document authors and schema designers.  All references to Unicode must be understood with respect to a particular version of the Unicode Standard greater than or equal to 3.0; which version should be used is left to the discretion of the document author or schema designer.

The first two suggestions are directly derived from the rules given for identifiers in the Unicode Standard, version 3.0, and exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned codepoints, and whitespace characters.  The other suggestions are mostly derived from XML 1.0 Appendix B.