Background to Changes in XML 1.0 5th Edition

Status of this Document

On 6 February 2008, W3C published Extensible Markup Language (XML) 1.0 (Fifth Edition) as a Proposed Edited Recommendation. This document provides context for the primary change to the definition of names, bringing one major benefit of XML 1.1 into XML 1.0.

Rationale for Primary Change

Since XML 1.1 became a W3C Recommendation in August 2006, there has been a substantial uptake of it as a peer of XML 1.0 in new and ongoing W3C work. This is appropriate, as XML 1.1 was explicitly not designed to replace XML 1.0, but to supplement it for the benefit of various groups against which XML 1.0 had unjustly, but unintentionally, discriminated.

However, there are very few XML 1.1 documents in the wild. The XML Core WG believes this to be the result of a vicious circle, in which widely distributed XML parsers do not support 1.1 because the parser authors believe that few document authors will use it. This becomes a self-fulfilling prophecy, as those who would benefit from XML 1.1 are rightfully concerned that documents written in it will not be widely acceptable.

After considering various other means by which to achieve the main goal of XML 1.1, that is, to deliver on XML's original promise of universality across all the world's languages, the XML Core WG proposes to change XML 1.0 to relax the restrictions on names, thereby providing in XML 1.0 the major end user benefit currently achievable only by using XML 1.1, and completing the decoupling XML from specific versions of Unicode.

To quote the XML 1.1 Recommendation:

The W3C's XML 1.0 Recommendation was first issued in 1998, and despite the issuance of many errata culminating in a Third Edition of 2004, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard on which XML 1.0 relies for character specifications has not remained static, evolving from version 2.0 to version 4.0 and beyond. Characters not present in Unicode 2.0 may already be used in XML 1.0 character data. However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0.

The overall philosophy of names has changed since XML 1.0. Whereas XML 1.0 provided a rigid definition of names, wherein everything that was not permitted was forbidden, XML 1.1 names are designed so that everything that is not forbidden (for a specific reason) is permitted. Since Unicode will continue to grow past version 4.0, further changes to XML can be avoided by allowing almost any character, including those not yet assigned, in names.

Since then, Unicode has continued its efforts to add scripts and characters in order to improve or add support for the world's languages and writing systems. This effort that is by no means complete. The changes since the XML 1.0 name character inventory was fixed encompass a variety of additions to the Unicode standard, and include support for:

additional scripts, including Ethiopic, Cherokee, Canadian Syllabics, Khmer, Mongolian, Yi, Philippine, New Tai Lue, Buginese, Syloti Nagri, N'Ko, and Tifinagh
many additional Han ideographs (used predominantly for Chinese)
additional characters for scripts that were incompletely understood at the time 2.0 was released, notably scripts native to South Asia: See Richard Ishida's recent blog entry for more details.

The proposed change to XML 1.0 will relax the restrictions on names, used not only for element and attribute names but also identifiers and enumerated attribute values. Those who prefer to retain the constraints on names from the previous version of XML 1.0 in their documents will be free to do so, but those who wish to use names that incorporate these additional characters will be able to do so.

Henry S. Thompson