W3C

XML Blueberry Requirements

W3C Working Draft 21 September 2001

This version:
http://www.w3.org/TR/2001/WD-xml-blueberry-req-20010921
Latest version:
http://www.w3.org/TR/xml-blueberry-req
Previous Version:
http://www.w3.org/TR/2001/WD-xml-blueberry-req-20010620
Editor:
John Cowan, Reuters ( )

Abstract 

This document lists the design principles and requirements for the Blueberry revision of the XML Recommendation, a limited revision of XML 1.0 being developed by the World Wide Web Consortium's XML Core Working Group solely to address character set issues.

Status of this document

This is a W3C Working Draft produced as a deliverable of the XML Core WG according to its charter and the current XML Activity process. A list of current W3C working drafts and notes can be found at http://www.w3.org/TR .

This document is a work in progress representing the current consensus of the W3C XML Core Working Group. It is published for review by W3C members and other interested parties. Publication as a Working Draft does not imply endorsement by the W3C membership. Comments should be sent to www-xml-blueberry-comments@w3.org , which is an automatically and publicly archived email list .

Table of Contents

1. Introduction
2. Design Principles
3. Requirements
4. References

1. Introduction

The W3C's XML 1.0 Recommendation [XML] was first issued in 1998, and despite the issuance of many errata culminating in a Second Edition of 2001, has remained (by intention) unchanged with respect to what is well-formed XML and what is not. This stability has been extremely useful for interoperability. However, the Unicode Standard [Unicode] on which XML 1.0 relies has not remained static, evolving from version 2.0 to version 3.1. Characters present in Unicode 3.1 but not in Unicode 2.0 may be used in XML character data.  However, they are not allowed in XML names such as element type names, attribute names, enumerated attribute values, processing instruction targets, and so on. In addition, some characters that should have been permitted in XML names were not, due to oversights and inconsistencies in Unicode 2.0.

As a result, fully native-language XML markup is not possible in at least the following languages: Amharic, Burmese, Canadian aboriginal languages, Cherokee, Dhivehi, Hakka Chinese (Bopomofo script), Khmer, Minnan Chinese (Bopomofo script), Mongolian (traditional script), Oromo, Syriac, Tigre, and Yi, because the characters required to write these languages did not exist in Unicode 2.0.  In addition, Chinese (particularly as used in Hong Kong) and Japanese can make use in XML names of only a subset of their complete character repertoires.

The point has been made that many of these languages can be written using other scripts, notably the Latin script, which makes
transliterated native markup possible.  However, exactly the same argument applies to many languages (for example, Greek) that were already fully encoded in Unicode 2.0.  Discriminating against languages simply because their scripts were not encoded in Unicode 2.0 is inherently unjust.  In addition, working with transliteration is far more painful for native readers and writers than working with the native script.

In addition, XML 1.0 attempts to adapt to the line-end conventions of various modern operating systems, but discriminates against the conventions used on IBM and IBM-compatible mainframes.  As a result, XML documents on mainframes are not plain text files according to the local conventions.  XML 1.0 documents generated on mainframes must either violate the local line-end conventions, or employ otherwise unnecessary translation phases before parsing and after generation.  Allowing straightforward interoperability is particularly important when data stores are shared between mainframe and non-mainframe systems (as opposed to being copied from one to the other).

A new XML version, rather than a set of errata to XML 1.0, is being created because the change affects the definition of well-formed documents.  XML 1.0 processors must continue to reject documents that contain new characters in XML names or new line-end conventions. It is presumed that the distinction between XML 1.0 and XML Blueberry will be indicated by the XML declaration.

2. Design Principles

  1. The XML 1.0 goals listed in section 1.1 of the XML Recommendation are reaffirmed.

  2. XML Blueberry documents shall permit the full and straightforward use of writing systems supported by Unicode 3.1.

  3. XML Blueberry documents shall permit the full and straightforward use of operating environments that support Unicode 3.1.

  4. The changes required for XML 1.0 processors to also process XML Blueberry shall be as few and as small as possible.

3. Requirements

  1. XML Blueberry documents shall allow the use within XML names of all Unicode 3.1 characters, insofar as appropriate for XML.

  2. XML Blueberry documents shall support the line-end conventions associated with Unicode 3.1, insofar as appropriate for XML.

  3. The working group shall consider the issue of future updates to Unicode.

  4. The working group shall consider the issue of W3C normalization as expressed in the W3C Character Model [CharMod].

  5. In creating XML Blueberry, the working group shall not consider any revisions to XML 1.0 except those needed to accomplish these requirements.

4. References

CharMod
W3C (World Wide Web Consortium). Character Model for the World Wide Web (work in progress). [Cambridge, MA]. http://www.w3.org/TR/charmod
XML
W3C (World Wide Web Consortium). Extensible Markup Language (XML) Recommendation. Version 1.0, 2nd edition. [Cambridge, MA]. http://www.w3.org/TR/REC-xml
Unicode
The Unicode Consortium. The Unicode Standard, Version 3.1. [Reading, MA: Addison-Wesley Developers Press, 2000]. http://www.unicode.org