A Notation for Character Collections for the WWW

W3C Note 14-January-2000

This version:: http://www.w3.org/TR/2000/NOTE-charcol-20000114
Latest version:: http://www.w3.org/TR/charcol
Previous versions:: http://www.w3.org/TR/1999/NOTE-charcol-19991105
Author:: Martin J. Dürst (W3C) <duerst@w3.org>

Abstract

An XML syntax for describing collections of characters is proposed. This will allow to reference character collections with URIs and thus to reference them from other protocols and formats. The main usage areas for character collections are schemas, forms, and stylesheets. Several constructs, in particular kernels, hulls, and alternatives, are provided to allow incomplete specifications and to increase network efficiency.

Status of this document

This document is an initial proposal made available as a NOTE by the World Wide Web Consortium (W3C) for further discussion. This indicates no endorsement of its content by the W3C at this moment. Depending on the response from interested/affected W3C Working Groups and from the W3C Members at large, this NOTE will be assigned to a Working Group (potentially the Internationalization Working Group) for further work.

This document is an editorial revision of an earlier version. A list of changes can be found in Appendix D.

Please send comments on this document to www-international@w3.org for public discussion, to the mailing list of the W3C Internationalization Interest Group (members only) for W3C-internal discussion, and directly to the author at duerst@w3.org for editorial issues. Experience reports from experimental implementations are welcome, but the W3C will not allow early implementations to constrain its ability to make changes.

A list of current W3C technical reports and publications, including Working Drafts and Notes, can be found at http://www.w3.org/TR/.

1. Introduction
2. Specification of the Notation
    2.1 Overview
    2.2 Kernels and Hulls
    2.3 Alternatives
3. Examples and Applications
    3.1 Well-known collections
    3.2 Open collections
    3.3 Efficient network access and lazy evaluation
    3.4 Built-in collections
    3.5 Character properties as collections
    3.6 CSS2: The 'unicode-range' property
    3.7 Styling
    3.8 Form Input
    3.9 Schemas and Regular Expressions
Acknowledgements
References
Appendix A: Definitions of Operators
Appendix B: DTD
Appendix C: Definition of <range> contents
Appendix D: List of Changes

1. Introduction

A precise and concise specification of operations on characters and strings is often desirable. Many of these operations depend on character types, character classes, or character properties. To simplify specifications and operations, a common notation helps. This NOTE is an attempt to develop such a notation, based on the concept of character collections.

A character collection is a set of characters. Once a notation for character collections is defined and agreed upon, collections can be easily defined and referenced via web addresses (URIs) [URI], can be enumerated, and can be constructed from other collections using set operators.

By making character collections available via an URI, they become first-class web objects. The URI of a collection serves both as an identifier (or name) of the collection as well as a way to obtain the description of the collection if necessary. There is no need to define names for collections separately from the URI.

For various reasons, it is often difficult to exactly specify a set of characters. To take care of such cases, hulls and kernels are introduced. A kernel contains characters that are guaranteed to be in the set; the set may contain other characters. A hull gives an outer boundary; characters which are not in the hull are guaranteed not to be in the set.

The syntax given in this NOTE limits characters to be codepoints in the Universal Character Set (UCS, [ISO10646], [Unicode]). The term 'character' correspondingly can be read as a synonym for 'UCS codepoint' throughout this NOTE. There may be reasons for extending the syntax to include e.g. combinations of UCS codepoints for specific applications. This is left for future discussions.

On the Web, network accessibility and efficiency are primary concerns. Alternative descriptions, lazy evaluation, built-in or well-known collections, as well as kernels and hulls can be used to address these concerns. Application areas for character collections include styling, forms, and schemas. These topics are discussed in Section 3.

The term character collection has been chosen here because character set is already used (and misused) in various contexts. See [Harm] for some background.

2. Specification of the Notation

2.1 Overview

The notation to specify collections uses XML [XML]. Each XML element in the syntax specifies a collection. Most elements are operators which combine a number of collections into a new collection. For the exact semantics, see the following sections and Appendix A. The DTD for a collection specification is given in Appendix B. The notation has the following elements:

<collection>: A root element for standalone collections. If a collection is embedded (using [Namespace]) in another XML document, <collection> may be omitted.
<range>: Used to define ranges/blocks of characters in a collection. The syntax is taken from CSS2 [CSS2], which provides easy notations for blocks and ranges. See Appendix C for details.
Note: Alternatively, blocks and ranges could be provided in XML notation, but this would probably be rather clumsy.
<enum>: Enumeration of characters representing themselves. Numeric character references can (of course) be used, because they are only syntactic sugar. XML whitespace/line breaking characters are used for appropriate presentation (i.e. to break long lines). For these as well as for characters not representable in XML (e.g. C0 control codes), use <range>, and combine with <union>. <range> can express everything that <enum> can, and is often more compact, but <enum> allows a more direct specification, without using a character standard document to look up codes.
<ref xlink:href="..."/>: Used to reference another collection using an URI. Well-known collections (see Section 3.1) defined by standard organizations and other parties should be assigned stable URIs. Software suppliers should publicize the collections that their products already know (built-in collections, Section 3.4), to allow to write compact and efficient collection descriptions. See Section 3 for more details on usage. For the exact syntax and the remaining attributes, see the DTD in Appendix B and the XLink Specification [XLink].
<union>: Operator to form the union of two or more collections. Convenient to add some characters to another collection, and to build collections of characters that satisfy at least one of several conditions.
<intersection>: Operator to form the intersection of two or more collections. Convenient to build collections of characters that satisfy various conditions at the same time.
<difference>: Operator to form the difference of two collections. Convenient to exclude some characters from another collection.
Note: The difference is not symmetric. The order of the operators matters. The difference consists of the characters in the first collection that are not in the second collection (for two collections A and B, usually denoted A-B).
Note: Instead of <difference>, a complement operator could be used.
Note: Should we add <empty/> and <all/> for completeness? Empty is not needed because it can be replaced by <range/> or <enum/>. <all/>, as well as <difference>, leads to the problem of defining the overall set.
<hull>: Operator to indicate that a collection serves as a hull of the actually intended collection. See below for more details.
<kernel>: Operator to indicate that a collection serves as a kernel of the actually intended collection. See below for more details.
<alt>: Used to combine several alternative descriptions. Convenient to combine a kernel and a hull, and to provide alternatives to cover network errors.

For details on how these elements can be combined, please see the DTD in Appendix B.

2.2 Kernels and Hulls

Because the repertoire of characters in the Universal Character Set is growing, characters may be continuously added to a set. In this case, it is impossible to specify the set exactly, but it is possible to specify that some characters are included, and others are excluded. Also, in some cases, collections may only be used as hints; in these cases, it is not necessary to specify a collection exactly. As an example, the CSS2 specification allows to indicate ranges of characters for which a given font may have appropriate glyphs; this is used to avoid downloading fonts that are completely unrelated to the characters that one tries to render.

For such cases, the concepts of hull and kernel are introduced. A kernel contains characters that are guaranteed to be in the collection; the collection may contain other characters. A hull gives an outer boundary so that characters which are not in the hull are guaranteed not to be in the collection; some characters in the hull may not actually be in the collection. This is shown in the following figure.

an oval, labeled 'kernel', inside a
dotted shape representing the actual collection, surrounded by an oval labeled
'hull'

In the markup for collections, <kernel>A</kernel> should be read as '(X) has a kernel A', and not as 'A has a kernel'. Similarly, <hull>B</hull> should be read as '(X) has a hull B', and not as 'B has a hull'.

A collection may not be exactly known (that's when kernels and hulls have their use), but as a concept it is assumed to be well defined. On the other hand, for each collection, there are a huge number of collections that can serve as a hull or as a kernel. There is usually a trade-off between efficiency and precision.

<union>, <intersection>, and <difference> work according to set algebra (see e.g. [Gerstling]). Given a collection description only containing <union>, <intersection>, <difference>, <enum>, and <range>, for each character c, the question "Is c in the collection?" can be answered with "yes" or "no". Kernels and hulls introduce a third answer, namely "don't know". Under set operations, answers behaves in the usual way (see Appendix A).

2.3 Alternatives

Unions or intersections don't allow to combine a kernel with a hull in the obvious way (i.e. "both apply"). For this, the <alt> operator is defined. If one of its operands "knows" something about a character, and the other doesn't, the result is taken from the operand that "knows". Collection X in the figure in Section 2.2 can be expressed using <alt>, <kernel>, and <hull>, as follows:

<collection>
  <alt>
    <kernel>A</kernel>
    <hull>B</hull>
  </alt>
</collection>

The <alt> operator also allows to deal with network access problems. As an example, assume that the collection we want to specify can be exactly and compactly specified in two ways, relying on two different network resources that are indicated with the <ref> element. Any of these network resources may be inaccessible, but the chance that both of them are inaccessible is lower than the chance that only one of them is inaccessible. In fact, implementations may have one or the other built in (see built-in collections, Section 3.4), in which case at least one of them is guaranteed to be accessible.

On usage of a collection description, the answer from a <ref> element that is not accessible is the same as e.g. for characters not in a kernel, i.e. "don't know". However, tools used to build collections should provide a way for somebody defining a collection to know whether all the referenced resources were accessible. This is very useful because the <alt> operator introduces a fourth potential answer, namely "error": if one of the operands of <alt> defines that a certain character is included in the overall collection, and another operand defines that this character is excluded, then there is a contradiction and therefore an error.

3. Examples and Applications

This section gives some usage examples, starting with examples that are more geared towards general mechanisms and continuing with examples in particular application fields.

3.1 Well-known collections

Various organizations may define character collections for general reference. As a very prominent and important example, Technical Corrigendum No. 2 to ISO/IEC 10646-1:1993 [ISO10646] provides a facility to define collections that can be identified by a collection number and a collection name. A significant number of collections is already defined in Appendix A of ISO/IEC 10646-1. These collections mainly refer to scripts (e.g. collection 11, ARMENIAN), subsets of scripts (e.g. collection 12, BASIC HEBREW), and ranges of related characters (e.g. collection 34, CURRENCY SYMBOLS; please note that this collection only includes currency symbols in the range U+20A0-20CF; other currency symbols, e.g. '$', are not included).

ISO/IEC 10646-1 identifies some of the collections as fixed. A fixed collection is exactly defined and does not change in the future. As an example, collection 1, BASIC LATIN, is a fixed collection, and can be expressed as:

<collection><range>U+20-7E</range></collection>

Another kind of well-known collections are the repertoires of coded character sets and character encodings (see e.g. the International Register of Coded Character Sets to be used with Escape Sequences [ISOIR] and the IANA 'charset' registry [IANA]).

Definitions of well-known collection should be made easily available on the Internet in a computer-processable form. This might be done preferably by the organization defining the collection or alternatively by another organization. The URIs chosen should be extremely stable, and should follow an uniform naming convention for a given series of collections.

3.2 Open collections

ISO/IEC 10646-1 [ISO10646] also has the concept of open collections. Open collections can include unassigned codepoints, and if characters are assigned to such a codepoint, they automatically become members of the collection. The notation of open collections is based on using a hull. As an example, collection 24 (MALAYALAM) can be defined as:

<collection><hull><range>U+D00-D7F, U+200C, U+200D</range></hull></collection>

This definition is somewhat unprecise because there are many characters of which we actually know that they are in the collection, while the above definition just tells us that they may be in the definition. This can be fixed in at least two ways. One way is to add a kernel that lists the currently defined characters:

<collection>
  <alt>
    <hull><range>U+D00-D7F, U+200C, U+200D</range></hull>
    <kernel>
      <range>
        U+D02-D03, U+D05-0C, U+D0E-D10, U+D12-D28, U+D2A-D39,
        U+D3E-D43, U+D46-D48, U+D4A-D4D, U+D47, U+D60-D61, U+D66-D6,
        U+200C, U+200D
      </range>
    </kernel>
  </alt>
</collection>

The other alternative is to assume the existence of a continuously updated collection containing all the currently defined characters:

<collection>
  <intersection>
    <range>U+D00-D7F, U+200C, U+200D</range>
    <ref href=" ... "/>
              <!-- pointer to collection of characters
                   currently defined in ISO/IEC 10646 -->
  </intersection>
</collection>

For network efficiency, the two solutions above can be combined with an <alt> operator. This is discussed in the next section.

3.3 Efficient network access and lazy evaluation

There are clearly large speed differences between accessing data locally and over the network, and in some cases, data only available over the network may not be available at all. The evaluation of collections should rely on standard network performance improvement mechanisms wherever possible, and should do this in a transparent way. In particular, caching is very useful to avoid repeated download of collection descriptions. It is expected that collection descriptions change extremely slowly, if at all. Servers serving collection data should be configured properly in order to let proxies and clients take full advantage of caching wherever possible.

Appropriate design of collection descriptions and implementations can also increase network efficiency. An implementation only needs to download a collection description referenced in a <ref> if it cannot decide otherwise whether a character in question is in the collection or not. This is called lazy evaluation. A collection definition can be written to help such implementations, by providing simple cutoffs. For example, assume that a collection definition contains an enumeration for all Han characters defined in JIS X 0208-1997 (the standard Japanese set of Kanji characters available on computers in Japan). This is about a third of the CJK range U+4E00-9FA5, sprinkled all over very irregularly. Rather than just referencing this collection with a <ref>, the following syntax should be used:

<collection>
  <alt>
    <hull><range>U+4E00-9FA5</range></hull>
    <ref href=" ... "/> <!-- pointer to JIS X 0208-19997 collection -->
  </alt>
</collection>

It should be noted that instead of an <alt> containing a <hull>, an <intersection> could also be used here. An <intersection> should be used for true intersections (i.e. when as a result of the intersection, characters drop out from both collections). A hull should be used when the collection is indeed not completely known. Another relation between intersections and hulls is that it would be possible to replace the hull element by an intersection with an element representing an unknown collection, e.g.

<hull><range>U+4E00-9FA5</range></hull>

could be replaced by

<intersection><range>U+4E00-9FA5</range><unknown/></intersection>

The same thing can be done with unions and kernels, i.e. a kernel can be replaced by an union of the kernel's contents and an unknown collection.

In order to allow efficient network access and caching, it is also important to realize in which cases a collection definition should be referenced, and in which cases it should be provided inline. Minor changes to existing collections and efficiency cuts should be provided inline to reduce network traffic. Well-known collections should be provided by reference to avoid duplication of data in caches and in memory.

Caching mechanisms cannot only help for efficient network access, they may also be used as one means to determine whether a collection is supposed to stay constant or whether it may change over time. This information is very important when designing collections referencing other collections. A designer of a collection wants to make sure that the referenced collection not only corresponds to his/her intentions at the time the new collection is designed, but also later on. In some cases, a constant collection (or a snapshot of an evolving collection) is desired, in other cases, an evolving collection is desired. This NOTE does not address the issues of metadata and trust regarding collections, because these issues are largely common to other Web-oriented notations, because they are largely orthogonal to the notation chosen, and because a stable registry for collections is already provided by ISO/IEC 10646 [ISO10646].

3.4 Built-in collections

Operating systems, libraries, and applications contain more and more data about characters, and this data can be used to reduce network traffic to access collection information as follows.

If a <ref> element is met when querying a collection definition, the URI contained in the href attribute is compared with an internal list of collections known to the system. Such a collection is called a built-in collection. It may be built into the operating system or the collection implementation, or may be available in other parts of the configuration, e.g. through a kind of registry mechanism.

If the URI matches with a collection known to the system, the system checks the character in question using the necessary operations. Different collections may be stored and accessed in different ways on the same system; for example, one collection may be defined as "the characters for which glyphs are available in a certain system font", whereas another collection may be defined as "all the characters that have a certain property according to the built-in property table".

Different systems may have different collections available built-in. Because checking a built-in definition is much more efficient than checking a collection that has to be referenced over the network, an efficient way to define a collection is to define it based on several built-in collections, each of them available on a different system.

As an example, assume that we want to define a collection that can be seen as collection abcde from os1 plus some small additions, or as collection xyz from os2 minus a small set of characters. This can be written as:

<collection>
  <alt>
    <union>
      <ref href="http://www.os1.com/collections/abcde.charcol"/>
      <range> ... </range>
    </union>
    <difference>
      <ref href="http://www.os2.com/collections/xyz.charcol"/>
      <range> ... </range>
    </difference>
  </alt>
</collection>

On os1, the collection abcde will be built-in, and no network access will be needed, and similar for os2.

3.5 Character properties as collections

Characters have a series of properties, such as whether they are letters, digits, or symbols, or, for letters, whether they are upper-case, or lower-case, or caseless. The Unicode Standard [Unicode] defines a large number of such properties.

Properties and collections in many cases can be seen as different views of the same data, because they are related in the following way:

All characters that have the same value for a given property together form a collection.
If the membership of characters in a series of collections is mutually exclusive (i.e. a character can only be in one collection of the series; in other words, the series defines a partition of the characters), then this series of collections can be modeled as a property, with a different property value for each collection of the series.

Because many character properties are well known, the corresponding collections can be treated as well-known collections (see Section 3.1). Because many systems store character properties very efficiently internally, such collections can easily be treated as built-in collections (see Section 3.4). Character collections can then be used to easily define local changes to properties by adding or removing a few characters.

3.6 CSS2: The 'unicode-range' property

CSS2 [CSS2, http://www.w3.org/TR/REC-CSS2/fonts.html#dataqual] provides a 'unicode-range' property to reduce network traffic for font downloading. This property is defined to be inclusive, i.e. the range may contain characters for which the font in question does not have any glyphs available. This can be explicitly expressed by a <hull> in our syntax. When character collections are accessible via URIs, it would be possible to extend the syntax of the 'unicode-range' property to allow URIs as values.

Also, the 'unicode-range' property is currently not explicitly defined to be exclusive, i.e. does not seem to be disallowed to use a glyph from the font in question even if the corresponding character is not contained in the value of the 'unicode-range' property. For finer control over font composition and glyph selection, it may be desirable to explicitly define that 'unicode-range' is exclusive, or to define another property for that purpose. Because the current range notation in CSS2 can become rather verbose, character collections as proposed in this document may help.

3.7 Styling

Text presentation and styling could often be specified conveniently by reference to a character collection. Examples include:

Line breaking: The collection of characters at which, before which, or after which to break a line, or at, before or after which not to break a line (e.g. Japanese Kinsoku,...).
Space handling: What to consider as a space, and how to collapse.
Character orientation: For characters not usually written in a given writing direction, indicating whether the character orientation should be adapted to the line direction or stay as in the original writing direction, or use some other direction.
Script-dependent font selection and styling.
Letter-spacing
Hanging punctuation

Character collections or references to character collections could appear both on the selector/template side of styling rules or as formatting property values. Using them on the selector/template side would define a general mechanism for all or a selected class of properties, i.e. it might be possible to say something like: Color all these characters red. Using character collections as formatting property values would be less general, limited to those cases where character collections are most useful.

3.8 Form Input

In the context of HTML [HTML] form input, it is desirable to have an indication of the characters expected or allowed in a particular text field. This is useful for two reasons:

The server processing the form input may not be able to process some characters in some fields. If this can already be detected at the client, the number of round trips can be reduced, and it may be easier to help the user. As an example, only the following characters may be allowed in a numeric field:
```
<enum>0123456789</enum>.
```
Although a wide range of characters may be allowed, a certain kind of characters is expected, and the client software may be able to select an appropriate input method/keyboard configuration for the user. As an example, a Greek form may indicate that Greek letters are expected in a family name field (although Latin letters may be accepted), but that Latin letters are expected in an email address field.

The two uses above should be clearly distinguished. Please note that while the first use fits the approach of testing individual characters very nicely, the second use does not directly fit this approach. The collection of characters covered by each input method/keyboard configuration available locally would have to be matched against the collection of characters expected in an input field. This may take considerably more time than checking only individual characters. However, there are various ways to reduce this effort. One way is to use very few characters 'typical' of the collections describing the available input method/keyboard configuration for sampling against the collection expected in an input field. Another is to limit the choice of collections allowed for describing the range of expected characters to some collections typically covered by input methods.

Limiting the choice of collection by explicitly listing the allowed URIs, while using the collection mechanism to clearly and unambiguously define the collection, may be a reasonable solution in other cases. It can limit implementation efforts while preserving forward compatibility.

3.9 Schemas and Regular Expressions

XML Schema [Schema] is currently working on ways to define restrictions to XML attribute and element contents. Both Cobol 'pictures' and regular expressions rely on character classes, but these classes are not general and flexible enough for international purposes. Character collections may provide more flexibility; a way to bind a character collection to a letter of a picture or an escape sequence of a regular expression would have to be provided.

Acknowledgements

V.S. Umamaheswaran (Uma) and Bruce Paterson for information on ISO/IEC 10646 collections. W3C WG/IG members and W3C team members, in particular Bert Bos, Dan Connolly, John Cowan, Masayasu Ichikawa, Rick Jelliffe, Mike Ksar, Chris Lilley, Masahiro Sekiguchi, Tex Texin, Andrea Vine, and François Yergeau for discussions and suggestions for improvement. All these contributions are very gratefully acknowledged, but any shortcommings in this NOTE should only be blamed to the author.

References

[CSS2]: Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds., Cascading Style Sheets, level 2 (CSS2 Specification), W3C Recommendation 12-May-1998, http://www.w3.org/TR/REC-CSS2/.
[IANA]: IANA charset registry, http://www.isi.edu/in-notes/iana/assignments/character-sets.
[ISO 10646]: ISO/IEC 10646-1:1993, Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane, and its amendments.
[ISOIR]: ISO International Register of Coded Character Sets To Be Used With Escape Sequences, http://www.itscj.ipsj.or.jp/ISO-IR/.
[Gerstling]: Judith L. Gerstling, Mathematical Structures for Computer Science, 4th edition, W. H. Freeman & Co., New York, 1998.
[Harm]: Daniel Connolly, Character Set Considered Harmful, W3C Note, 02-May-1995, http://www.w3.org/MarkUp/html-spec/charset-harmful.
[HTML]: Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.0 Specification, W3C Recommendation 18-December-1997 (revised on 24-April-1998), http://www.w3.org/TR/REC-html40/.
[Namespace]: Tim Bray, Dave Hollander, Andrew Layman, Eds., Namespaces in XML, W3C Recommendation 14-January-1999, http://www.w3.org/TR/REC-xml-names/.
[Schema]: Paul V. Biron, Ashok Malhotra, Eds., XML Schema Part 2: Datatypes, work in progress, http://www.w3.org/TR/xmlschema-2/.
[Unicode]: The Unicode Consortium, The Unicode Standard, Version 3.0. Refer also to http://www.unicode.org/unicode/standard/versions/.
[URI]: T. Berners-Lee, R. Fielding, L. Masinter, Uniform Resource Identifiers (URI): Generic Syntax, RFC 2396, August 1998, http://www.ietf.org/rfc/rfc2396.txt.
[XLINK]: S. DeRose, D. Orchard, B. Trafford, Eds., XML Linking Language (XLink), work in progress, http://www.w3.org/1999/07/WD-xlink-19990726, 26-July-1999.
[XML 1.0]: Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eds., Extensible Markup Language (XML) 1.0, W3C Recommendation 10-February-1998, http://www.w3.org/TR/REC-xml.

Appendix A: Definitions of Operators

This appendix exactly defines what each of the collection operators means. For operators that can take more than two operands, only the definition for two operands is given; this is sufficient because all these operators are transitive. All the operations that allow more than two operands are also commutative. For each operator, a table and a textual description is given; these descriptions are identical. In the tables, the symbols "+", "-", and "?" stand for included, not included, and unknown.

<hull> A </hull>

A+	A?	A-
?	?	-

For any character x and collection A:

If x is in A, then it's unknown whether x is in <hull>A</hull> or not.
If it is unknown whether x is in A or not, then it is unknown whether x is in <hull>A</hull> or not.
If x is not in A, then x is not in <hull>A</hull>.

<kernel> A </kernel>

A+	A?	A-
+	?	?

For any character x and collection A:

If x is in A, then x is in <kernel>A</kernel>.
If it is unknown whether x is in A or not, then it is unknown whether x is in <kernel>A</kernel> or not.
If x is not in A, then it's unknown whether x is in <kernel>A</kernel> or not.

<union> A B ... </union>

	A+	A?	A-
B+	+	+	+
B?	+	?	?
B-	+	?	-

For any character x, collection A, and collection B:

If x is in A, then
- if x is in B, then x is in <union>AB</union>.
- if it is unknown whether x is in B or not, then x is in <union>AB</union>.
- if x is not in B, then x is in <union>AB</union>.
If it is unknown whether x is in A or not, then
- if x is in B, then x is in <union>AB</union>.
- if it is unknown whether x is in B or not, then it's unknown whether x is in <union>AB</union> or not.
- if x is not in B, then it's unknown whether x is in <union>AB</union> or not.
If x is not in A, then
- if x is in B, then x is in <union>AB</union>.
- if it is unknown whether x is in B or not, then it's unknown whether x is in <union>AB</union> or not.
- if x is not in B, then x is not in <union>AB</union>.

<intersection> A B ... </intersection>

	A+	A?	A-
B+	+	?	-
B?	?	?	-
B-	-	-	-

For any character x, collection A, and collection B:

If x is in A, then
- if x is in B, then x is in <intersection>AB</intersection>.
- if it is unknown whether x is in B or not, then it's unknown whether x is in <intersection>AB</intersection> or not.
- if x is not in B, then x is not in <intersection>AB</intersection>.
If it is unknown whether x is in A or not, then
- if x is in B, then it's unknown whether x is in <intersection>AB</intersection> or not.
- if it is unknown whether x is in B or not, then it's unknown whether x is in <intersection>AB</intersection> or not.
- if x is not in B, then x is not in <intersection>AB</intersection>.
If x is not in A, then
- if x is in B, then x is not in <intersection>AB</intersection>.
- if it is unknown whether x is in B or not, then x is not in <intersection>AB</intersection>.
- if x is not in B, then x is not in <intersection>AB</intersection>.

<difference> A B </difference>

	A+	A?	A-
B+	-	-	-
B?	?	?	-
B-	+	?	-

For any character x, collection A, and collection B:

If x is in A, then
- if x is in B, then x is not in <difference>AB</difference>.
- if it is unknown whether x is in B or not, then it's unknown whether x is in <difference>AB</difference> or not.
- if x is not in B, then x is in <difference>AB</difference>.
If it is unknown whether x is in A or not, then
- if x is in B, then x is not in <difference>AB</difference>.
- if it is unknown whether x is in B or not, then it's unknown whether x is in <difference>AB</difference> or not.
- if x is not in B, then it's unknown whether x is in <difference>AB</difference> or not.
If x is not in A, then
- if x is in B, then x is not in <difference>AB</difference>.
- if it is unknown whether x is in B or not, then x is not in <difference>AB</difference>.
- if x is not in B, then x is not in <difference>AB</difference>.

<alt> A B ... </alt>

	A+	A?	A-
B+	+	+	error
B?	+	?	-
B-	error	-	-

For any character x, collection A, and collection B:

If x is in A, then
- if x is in B, then x is in <alt>AB</alt>.
- if it is unknown whether x is in B or not, then x is in <alt>AB</alt>.
- if x is not in B, then there is an error with <alt>AB</alt>.
If it is unknown whether x is in A or not, then
- if x is in B, then x is in <alt>AB</alt>.
- if it is unknown whether x is in B or not, then it's unknown whether x is in <alt>AB</alt> or not.
- if x is not in B, then x is not in <alt>AB</alt>.
If x is not in A, then
- if x is in B, then there is an error with <alt>AB</alt>.
- if it is unknown whether x is in B or not, then x is not in <alt>AB</alt>.
- if x is not in B, then x is not in <alt>AB</alt>.

Appendix B: DTD

This is the DTD for the syntax defined in this NOTE. A namespace URI is currently not defined.

<!ENTITY % coll "(hull|kernel|union|intersection|difference|alt|ref|range|enum)">
<!ELEMENT collection (%coll;)>
<!ELEMENT hull (%coll;)>
<!ELEMENT kernel (%coll;)>
<!ELEMENT union (%coll;, (%coll;)+)>
<!ELEMENT intersection (%coll;, (%coll;)+)>
<!ELEMENT difference (%coll;, %coll;)>
<!ELEMENT alt (%coll;, (%coll;)+)>
<!ELEMENT ref EMPTY>
<!ATTLIST ref
          xmlns:xlink CDATA #FIXED "http://www.w3.org/XML/XLink/0.9"
          xlink:type (simple|extended|locator|arc) #FIXED "simple" 
          xlink:role CDATA #FIXED "character collection"
          xlink:title CDATA #IMPLIED
          xlink:show (new|parsed|replace) #FIXED 'parsed'
          xlink:actuate (user|auto) #FIXED 'auto'
          xlink:href CDATA #REQUIRED>
<!ELEMENT range (#PCDATA)>
<!ELEMENT enum  (#PCDATA)>

Appendix C: Definition of <range> contents

This is supposed to reflect the syntax in the CSS2 specification (http://www.w3.org/TR/REC-CSS2/fonts.html#dataqual) as exactly as possible. This is one of the low-level building block for collections, not captured by XML structure. To define the syntax, the notation used in the XML 1.0 Recommendation [XML1.0] is used, and some common definitions are taken from there. (Note: The CSS2 syntax allows \f in whitespace, but XML does not allow this, and therefore it is not allowed in <range>).

The content of <range>, a RangeEnumeration, is a comma-separated list of Ranges.

RangeEnumeration ::= S Urange (S ',' S Urange ) * S

Note: Is it possible in CSS to have a comma at the end?

An Urange is either given as an Ublock (allowing question marks) or with a starting and ending Uvalue. The Urange includes all the characters between the starting and the ending Uvalues, including the starting and ending Uvalues themselves.

Urange ::= ( 'U' '+' Ublock ) | ( 'U' '+' Uvalue '-' Uvalue )

An Ublock is a (potentially empty) sequence of hex characters followed by a (potentially empty) sequence of question marks, for a total of at least one and at most six symbols. The definition is a bit lengthy because the XML notation does not provide a repetition indicator.

Ublock ::= '?' '?'? '?'? '?'? '?'? '?'?
           | [0-9a-fA-F]
             ('?'? '?'? '?'? '?'? '?'?
              | [0-9a-fA-F]
                ('?'? '?'? '?'? '?'?
                 | [0-9a-fA-F
                   ('?'? '?'? '?'?
                    | [0-9a-fA-F]
                      ('?'? '?'?
                       | [0-9a-fA-F]
                         ('?'?
                          | [0-9a-fA-F]
                         )
                      )
                   )
                )
             )

An Ublock is equivalent to an explicit range expressed with a hyphen by using the Ublock in both Uvalues and replacing the question marks in the first Uvalue by '0's, and the question marks in the second Uvalue by 'F's. As an example, U+123?? is the same as U+12300-U+123FF. An Uvalue is a sequence of at least one and at most six hexadecimal digits:

Uvalue ::= [0-9a-fA-F] [0-9a-fA-F]? [0-9a-fA-F]? [0-9a-fA-F]? [0-9a-fA-F]? [0-9a-fA-F]?

Please note that while this notation is originally taken from Unicode [Unicode], Unicode uses a different notation for cases with more than four hexadecimal digits, replacing the '+' with a '-' and requiring eight digits (e.g. U-00012300). The definition of 'unicode-range' mentions a max value of U+7FFFFFFF, but the syntax only allows values up to U+FFFFFF.

Some more examples, from the CSS2 specification (adjusted for the fact that character collection specifications in general are more detailed, and therefore can be more verbose, than CSS2 unicode-range values, and updated for proposed registrations):

U+20A7: no wild cards - it indicates a single character position (the Spanish peseta currency symbol)
U+215?: one wild card, covers the range 2150 to 215F (the fractions)
U+00??: two wild cards, covers the range 0000 to 00FF (Latin-1)
U+E??: two wild cards, covers 0E00 to 0EFF (the Lao script)
U+AC00-D7FF: the range is AC00 to D7FF (the Hangul Syllables area)
U+370-3FF, U+1F??: This covers the range 0370 to 03FF (Modern Greek) plus 1F00 to 1FFF (Ancient polytonic Greek).
U+3000-303F, U+3100-312F, U+32??, U+33??, U+4E00-9FFF, U+F9000-FAFF, U+FE30-FE4F: A bit verbose, this suggests that this large font contains only Chinese characters from ISO 10646, without including any characters that are uniquely Japanese or Korean. The range is 3000 to 303F (CJK symbols and punctuation) plus 3100 to 312F (Bopomofo) plus 3200 to 32FF (enclosed CJK letters and months) plus 3300 to 33FF (CJK compatibility zone) plus 4E00 to 9FFF (CJK unified Ideographs) plus F900 to FAFF (CJK compatibility ideographs) plus FE30 to FE4F (CJK compatibility forms).
A more likely representation for a typical Chinese font would be:: U+3000-33FF, U+4E00-9FFF
U+13800-13BFF: This covers a tentative reservation for Aztec pictograms, covering the range 3800 to 3BFF in plane 1.
U+1680-169C: This covers Irish Ogham covering the range 1680 to 169C as available in Unicode 3.0.

Appendix D: List of Changes

Changes since http://www.w3.org/TR/1999/NOTE-charcol-19991105:

Added some explanation about URIs as identifiers/names to Introduction.
Clarified that <difference> is non-commutative in Overview.
Added a reference to set theory basics to Section 2.2 and References.
Changed the figure in Section 2.2 to be able to refer to kernel and hull. Improved and added wording for kernel and hull descriptions.
Added <enum> to the last paragraph of Section 2.2.
Removed the word 'difference' from Section 3.4 to avoid confusions.
Added Masayasu Ishikawa to the Acknowledgements.
Removed spurious sentence from definition of <union> in Appendix A.
Exchanged rows and columns in tables in Appendix A to have '?' always last. Changed textual description appropriately.

A Notation for Character Collections for the WWW

W3C Note 14-January-2000

Abstract

Status of this document

Table of Contents

1. Introduction

2. Specification of the Notation

2.1 Overview

2.2 Kernels and Hulls

2.3 Alternatives

3. Examples and Applications

3.1 Well-known collections

3.2 Open collections

3.3 Efficient network access and lazy evaluation

3.4 Built-in collections

3.5 Character properties as collections

3.6 CSS2: The 'unicode-range' property

3.7 Styling

3.8 Form Input

3.9 Schemas and Regular Expressions

Acknowledgements

References

Appendix A: Definitions of Operators

<hull> A </hull>

<kernel> A </kernel>

<union> A B ... </union>

<intersection> A B ... </intersection>

<difference> A B </difference>

<alt> A B ... </alt>

Appendix B: DTD

Appendix C: Definition of <range> contents

Appendix D: List of Changes