XML Schema Part 2: Datatypes

World Wide Web Consortium Working Draft 06-May-1999

This version

http://www.w3.org/1999/05/06-xmlschema-2/
(with accompanying schema and DTD)

Latest version:

http://www.w3.org/TR/xmlschema-2/

Editors:

Paul V. Biron (Kaiser Permanente, for Health Level Seven) <Paul.V.Biron@kp.org>
Ashok Malhotra (IBM) <petsa@us.ibm.com>

Status of this Document

This is a W3C Working Draft for review by members of the W3C and other interested parties in the general public.

It has been reviewed by the XML Schema Working Group and the Working Group has agreed to its publication. Note that not that all sections of the draft represent the current consensus of the WG. Different sections of the specification may well command different levels of consensus in the WG. Public comments on this draft will be instrumental in the WG's deliberations.

Please review and send comments to www-xml-schema-comments@w3.org ( archive).

The facilities described herein are in a preliminary state of design. The Working Group anticipates substantial changes, both in the mechanisms described herein, and in additional functions yet to be described. The present version should not be implemented except as a check on the design and to allow experimentation with alternative designs. The Schema WG will not allow early implementation to constrain its ability to make changes to this specification prior to final release.

A list of current W3C working drafts can be found at http://www.w3.org/TR. They may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".

Ed. Note: Several "note types" are used throughout this draft:

issue [Issue (issue-name): ]

something on which the editors are seeking comment

editorial note [Ed. Note: ]

something the editors wish to call to the attention of the reader. To be removed prior to the recommendation becoming final.

note [Note: ]

something the editors wish to call to the attention of the reader. To remain in the final recommendation.

Abstract

This document specifies a language for defining datatypes to be used in XML Schemas and, possibly, elsewhere.

1. Introduction
    1.1 Purpose
    1.2 Requirements
    1.3 Scope
    1.4 Terminology
    1.5 Organization
2. Type System
    2.1 Datatype
    2.2 Value space
    2.3 Lexical Space
    2.4 Characterizing operations
        2.4.1 Equal
    2.5 Facets
        2.5.1 Fundamental facets
        2.5.2 Constraining or Non-fundamental facets
    2.6 Datatype dichotomies
        2.6.1 Atomic vs. aggregate datatypes
        2.6.2 Primitive vs. generated datatypes
        2.6.3 Built-in vs. user-generated datatypes
3. Built-in datatypes
    3.1 Namespace considerations
    3.2 Primitive datatypes
        3.2.1 ID
        3.2.2 IDREF
        3.2.3 IDREFS
        3.2.4 ENTITY
        3.2.5 ENTITIES
        3.2.6 NMTOKEN
        3.2.7 NMTOKENS
        3.2.8 NOTATION
        3.2.9 string
        3.2.10 boolean
        3.2.11 number
        3.2.12 dateTime
        3.2.13 binary
        3.2.14 uri
    3.3 Generated datatypes
        3.3.1 integer
        3.3.2 decimal
        3.3.3 real
        3.3.4 date
        3.3.5 time
        3.3.6 timePeriod
4. User-generated datatypes
5. Conformance

Appendices

A. Schema for Datatype Definitions (normative)
B. DTD for Datatype Definitions (normative)
C. Built-in Generated Datatype Definitions (normative)
D. Pictures
E. Regular Expressions
F. References
G. Open Issues

1. Introduction

1.1 Purpose

The [XML] specification defines a limited array of facilities for applying datatypes to document content in that documents may contain or refer to DTDs that assign types to elements and attributes. However, document authors, including authors of traditional documents and those transporting data in XML, often require a high degree of type checking to ensure robustness in document understanding and data interchange.

The table below offers two typical examples of XML instances in which datatypes are implicit: the instance on the left represents a billing invoice, the instance on the right a memo or perhaps an email message in XML.

Data oriented

Document oriented

<invoice>
   <orderDate>Jan 21, 1999</orderDate>
   <shipDate>Jan 25, 1999</shipDate>
   <billingAddress>
      <name>Ashok Malhotra</name>
      <street>123 IBM Ave.</street>
      <city>Hawthorne</city>
      <state>NY</state>
      <zip>10532-0000</zip>
   </billingAddress>
   <voice>555-1234</voice>
   <fax>555-4321</fax>
</invoice>

<memo importance="high"
      date="03/23/1999">
   <from>Paul V. Biron</from>
   <to>Ashok Malhotra</to>
   <subject>Latest draft</subject>
   <body>
      We need to discuss the latest
      draft <emph>immediately</emph>.
      Either email me at <email>
      mailto:paul.v.biron@kp.org</email>
      or call <phone>555-9876</phone>
   </body>
</memo>

The invoice contains several dates and telephone numbers, a state (which comes from an enumerated list of values), and a zip code (which has a definable regularity of expression). The memo contains many of the same types of information: a date, a telephone number, an email address and an "importance" value (which undoubtedly comes from an enumerated list, such as "low", "medium" or "high"). Applications which process invoices and memos need to raise exceptions if something that was supposed to be a date or a telephone number does not conform to the rules for valid dates or telephone numbers.

In both cases, validity constraints exist on the content of the instances that are not expressible in XML DTDs. The limited datatyping facilities in XML have prevented validating XML processors from supplying the rigorous type checking required in these situations. The result has been that application writers have had to implement type checking in an ad hoc manner. This specification addresses the needs of both document authors and applications writers for a robust, extensible datatype system for XML which could be incorporated into XML processors. As discussed below, these datatypes can be used in other XML-related standards as well.

1.2 Requirements

The [XML Schema Requirements] document spells out concrete requirements to be fulfilled by this specification. This states that the XML Schema Language must:

provide for primitive data typing, including byte, date, integer, sequence, SQL & Java primitive data types, etc.;
define a type system that is adequate for import/export from database systems (e.g., relational, object, OLAP);
distinguish requirements relating to lexical data representation vs. those governing an underlying information set;
allow creation of user-defined datatypes, such as datatypes that are derived from existing datatypes and which may constrain certain of its properties (e.g., range, precision, length, mask).

1.3 Scope

This portion of the XML Schema Language discusses datatypes that can be used in a XML Schema. These datatypes can be specified for element content that would be specified as #PCDATA and attribute values of various types in a DTD. It is the intension of this specification that it be usable outside of the context of XML Schemas for a wide range of other XML-related activities such as [XSL] and [RDF Schema].

For the most part, this specification discusses what are sometimes referred to as scalar datatypes in that they constrain the lexical representation of a single literal. In some cases, as for example in [IDREFS], [ENTITIES] and [NMTOKENS], the value may consist of a list or set of literals separated by spaces. This is an example of what is called an aggregate datatype. Future versions of this specification will contain a more general mechanism for defining and using aggregate (collection) datatypes such as sets, bags and records.

1.4 Terminology

Ed. Note: if necessary, insert a terminology list (e.g., may, must, datatype valid, etc.)

1.5 Organization

Following this introduction, the remainder of this specification is organized into three main sections: [Type System] describes the conceptual framework upon which the datatype system is constructed; [Built-in datatypes] details the complete list of datatypes which all conforming processors must implement; [User-generated datatypes] discusses how to define specialized types from the built-in types and [Conformance] specifies the general rules concerning conforming processors.

2. Type System

This section describes the conceptual framework behind the type system defined in this specification. The framework has been influenced by the [ISO 11404] standard on language-independent datatypes as well as the datatypes for [SQL] and for programming languages such as Java.

The datatypes discussed in this specification are computer representations of well known abstract concepts such as integer and date. It is not the place of this specification to define these concepts. Many other publications provide excellent definitions.

Two concepts are essential for an understanding of datatypes as they are discussed here: a value space is an abstract collection of permitted values for the datatype. Each datatype also has a space of valid lexical representations or literals. A single value in the value space may map to several valid literals.

2.1 Datatype

[Definition: ] In this specification, a datatype has a set of distinct values, called its value space, and is characterized by facets and/or properties of those values and by operations on or resulting in those values. Further, each datatype is characterized by a space consisting of valid lexical representations for each value in the value space.

2.2 Value space

A value space is a abstract collection of permitted values for the datatype. Value spaces have certain properties. For example, they always have the concept of cardinality and equality and may have the concept of order by which individual values within the value space can be compared to one another. Value spaces may also support operations on values such as addition.

[Definition: ] A value space is the collection of permitted values for a given datatype. The value space of a given datatype can be defined in one of the following ways:

enumerated outright, sometimes referred to as an extensional definition
defined axiomatically from fundamental notions, sometimes referred to as an intensional definition
defined as the subset of values from an already defined value space with a given set of properties
defined as a combination of values from some already defined value space by a specific construction procedure

2.3 Lexical Space

In addition to its value space, each datatype has a lexical representation space. [Definition: ] The lexical space for a datatype consists of a set of valid literals. Each value in the datatype's value space maps to one or more valid literals in its lexical space. For example, "100.0" and "1.0E2" are two different representations for the same value. Depending on the situation, either or both of these representations might be acceptable. The type system defined in this specification provides a mechanism for schema designers to control the value set as well as the acceptable lexical representations of the values in the value space of a datatype. Each [Primitive datatypes] definition includes a detailed description of the default lexical space.

2.4 Characterizing operations

Many different datatypes may share the same value space. As a result, a datatype is only partially defined by its value space. [Definition: ] The characterizing operations for a datatype are those operations (such as "add" or "append") on or resulting in values of the datatype which distinguish this datatype from other datatypes having value spaces which are identical except possibly for substitution of literals.

Characterizing operations can be useful in choosing the appropriate datatype for particular purposes, such as mapping to or from common programming languages or database environments.

Ed. Note: Currently, no characterizing operations are defined on the built-in datatypes provided by this specification; additionally, there is no means to specify characterizing operations on user-generated datatypes. This will be addressed in a future draft.

This discussion of characterizing operations in the definition of datatype is for pedagogical purposes only and does not imply that conforming processors must implement those operations, nor does it imply that expressions (containing operators) which evaluate to a given datatype will be accepted by conforming XML processors.

2.4.1 Equal

Every value space supports the notion of equality, with the following rules:

for any two instances of values from the value space (a, b), either a is equal to b, denoted a = b, or a is not equal to b, denoted a ≠ b;
there is no pair of instances (a, b) of values from the value space such that both a = b and a ≠ b;
for every value a from the value space, a = a;
for any two instances (a, b) of values from the value space, a = b if and only if b = a;
for any three instances (a, b, c) of values from the value space, if a = b and b = c, then a = c.

On every datatype, the operation Equal is defined in terms of the equality property of the value space: for any values a, b drawn from the value space, Equal(a,b) is true if a = b, and false otherwise.

2.5 Facets

[Definition: ] A facet is a single defining aspect of a concept or an object. Generally speaking, each facet of an item characterizes that item along independent aspects or dimensions.

The facets of a datatype serve to distinguish those aspects of one datatype which differ from other datatypes. Rather than being defined solely in terms of a prose description the datatypes in this specification are defined in terms of the synthesis of facet values which together determine the value space and properties of the datatype.

Facets are of two types: fundamental facets that define the datatype and non-fundamental or constraining facets that constrain the permitted values of a datatype.

2.5.1 Fundamental facets

Datatypes are characterized by properties of their value spaces. These optional properties are discussed in this section. Each of these properties give rise to a facet that serves to characterize the datatype.

2.5.1.1 Order

[Definition: ] A value space, and hence a datatype, is said to be ordered if there exists an order relation defined for that value space. Order relations have the following rules:

for every pair (a, b) from the value space, either a ≤ b or b ≤ a, or a = b;
for every triple (a, b, c) from the value space, if a ≤ b and b ≤ c, then a ≤ c.

If a value space is ordered, then the datatype will have a corresponding [Characterizing operations], called InOrder(a, b), defined by:

for every (a, b) from the value space, InOrder(a, b) is true if a ≤ b, and false otherwise.

There may exist several possible order relations for a given value space. Additionally, there may exist multiple datatypes with the same value space. In such cases, each datatype will define a different order relation on the value space.

Ed. Note: Currently, no order relations are defined on the built-in datatypes provided by this specification; additionally, there is no means to specify an order relation on user-generated datatypes. This will be addressed in a future draft.

2.5.1.2 Bounds

Some ordered value spaces, and hence some datatypes, are said to be bounded. [Definition: ] A value space is bounded above if there exists a unique value U in the value space such that, for all values v in the value space, v ≤ U. The value U is said to be an upper bound of the value space. [Definition: ] A value space is bounded below if there exists a unique value L in the space such that, for all values v in the value space, L ≤ v. The value L is then said to be a lower bound of the value space.

[Definition: ] A datatype is bounded if its value space has both an upper and a lower bound.

2.5.1.3 Cardinality

[Definition: ] Every value space has associated with it the concept of cardinality. Some value spaces are finite, some are countably infinite while still others are uncountably infinite. A datatype is said to have the cardinality of its value space. It is sometimes useful to categorize value spaces ( and hence, datatypes) as to their cardinality. There are three significant cases:

value spaces that are finite
value spaces that are countably infinite and exact (see [Exact and Approximate])
value spaces that are countably infinite and approximate (see [Exact and Approximate])

Every conceptually finite value space is necessarily exact. No computational datatype is uncountably infinite.

Ed. Note: Currently, cardinality is not specified for the built-in datatypes provided by this specification; additionally, there is no means to specify a cardinality on user-generated datatypes. This will be addressed in a future draft.

2.5.1.4 Exact and Approximate

The computational model of a datatype may limit the degree to which values of the datatype can be distinguished. If every value in the value space of the conceptual datatype is distinguishable in the computational model from every other value in the value space, then the datatype is said to be exact.

Certain mathematical datatypes having values which do not have finite representations are said to be approximate, in the following sense:

Let M be the mathematical datatype and C be the corresponding computational datatype, and let P be the mapping from the value space of M to the value space of C. Then for every value v' in C, there is a corresponding value v in M and a real value h such that P(x) = v' for all X in M such that |v - x| < h. That is, v' is the approximation in C to all values in M which are "within distance h of value v". Furthermore, for at least one value v' in C, there is more than one value y in M such that P(y) = v' And thus C is not an exact model of M.

In this specification, all approximate datatypes have computational models which specify, via parametric values, a degree of approximation, that is, they require a certain minimum set of values of the mathematical datatype to be distinguishable in the computational datatype.

Ed. Note: Currently, exactness is not specified for the built-in datatypes provided by this specification; additionally, there is no means to specify a exactness for user-generated datatypes. This will be addressed in a future draft.

2.5.1.5 Numeric

A datatype is said to be numeric if its values are conceptually quantities (in some mathematical number system). A datatype whose values do not have this property is said to be non-numeric.

2.5.2 Constraining or Non-fundamental facets

Constraining facets are optional properties that can be applied to a datatype to (further) constrain its value space. Constraining the value space consequently constrains the allowed lexical representations. Adding constraining facets to a [Base type] is used to define [User-generated datatypes].

Ed. Note: should we consider units/dimensionality now? or wait for a further draft? Note that [timePeriod] implicitly has units.

2.5.2.1 Length

[Definition: ] For the [string] datatype, length specifies the maximum number of allowable characters in the string. For the [binary] datatype it specifies the maximum length in bytes.

Ed. Note: We need to ultimately reconcile the notion of string length with the resolution of the i18n issues around character, indexing, etc.

2.5.2.2 Maximum Length

[Definition: ] The maxlength facet indicates the maximum length, in bytes, of a [string] datatype for which the length facet is not specified.

2.5.2.3 Lexical representation

The datatypes defined in this specification are defined in terms of abstract value spaces and their properties as opposed to how values are lexically represented in XML instances. However, the lexical representation of values is of prime importance in many applications. Because of this importance, each [Primitive datatypes] definition includes a detailed description of its default [Lexical Space]. [Definition: ] The lexical representation facet can be used to constrain the allowable representations, or literals, for values of a datatype. The meaning of the lexical representation facet depends on the datatype to which it is applied.

For example, for [string], values for the lexical representation facet are either [Pictures] or [Regular Expressions], while for [dateTime], values are derived from [ISO 8601] and [SQL].

2.5.2.4 Enumeration

[Definition: ] Presence of an enumeration facet constrains the value space of the datatype to one of the specified list. The enumeration facet can be applied to any datatype. No order or any other relationship is implied between the elements of the enumeration list.

2.5.2.5 maxInclusive

[Definition: ] The maxInclusive facet determines the upper bound of the value space for a datatype with the [Order] property. The maximum value specified with this facet is inclusive in the sense that the value specified for the facet is itself included in the value space for the datatype.

2.5.2.6 maxExclusive

[Definition: ] The maxExclusive facet determines the upper bound of the value space for a datatype with the [Order] property. The maximum value specified with this facet is exclusive in the sense that the value specified for the facet is itself excluded from the value space for the datatype.

2.5.2.7 minInclusive

[Definition: ] The minInclusive facet determines the lower bound of the value space for a datatype with the [Order] property. The minimum value specified with this facet is inclusive in the sense that the value specified for the facet is itself included in the value space for the datatype.

2.5.2.8 minExclusive

[Definition: ] The minExclusive facet determines the lower bound of the value space for a datatype with the [Order] property. The minimum value specified with this facet is exclusive in the sense that the value specified for the facet is itself excluded from the value space for the datatype.

2.6 Datatype dichotomies

It is useful to categorize the datatypes defined in this specification along various dimensions, forming a set of characterization dichotomies.

2.6.1 Atomic vs. aggregate datatypes

The first distinction to be made is that between atomic and aggregate datatypes.

[Definition: ] Atomic datatypes are those having values which are intrinsically indivisible.
[Definition: ] Aggregate datatypes are those having values which can be decomposed into two or more component values.

For example, a date that is represented as a single character string could be the value of an atomic date datatype; while another date represented as separate "month", "day" and "year" elements would be the value of an aggregate date datatype. Not surprisingly, the distinction is analogous to that between an XML element whose content model is #PCDATA and one with element content.

As discussed above, this specification focuses mainly on atomic datatypes. Later versions will address aggregate datatypes in more detail. Note that the XML attribute types [IDREFS], [ENTITIES] and [NMTOKENS] can be thought of as aggregate (list) types.

A datatype which is atomic in this specification need not be an "atomic" datatype in any programming language used to implement this specification.

2.6.2 Primitive vs. generated datatypes

[Definition: ] Primitive datatypes are those that are not defined in terms of other datatypes; they exist ab initio.
[Definition: ] Generated datatypes are those that are defined in terms of other datatypes.

For example, a [number] is a well defined mathematical concept that cannot be defined in terms of other datatypes while a [date] is a special case of the more general datatype [dateTime].

The datatypes defined by this specification fall into both the primitive and the generated categories. It is felt that a judiciously chosen set of primitive datatypes will serve the widest possible audience by providing a set of convenient datatypes that can be used as is, as well as providing a rich enough base from which the variety of datatypes needed by schema designers can be generated.

A datatype which is primitive in this specification need not be a "primitive" datatype in any programming language used to implement this specification.

2.6.2.1 Base type

[Definition: ] Every generated datatype is defined in terms of an existing datatype, referred to as the base type. Base types may be either primitive or generated.

In the example above, [date] is referred to as a subtype of the base type [dateTime]. The value space of a subtype is a subset of the value space of the base type.

2.6.3 Built-in vs. user-generated datatypes

[Definition: ] Built-in datatypes are those which are entirely defined in this specification, and may be either primitive or generated;
[Definition: ] User-generated datatypes are those generated datatypes whose base types are built-in datatypes or user-generated datatypes and are defined by individual schema designers by giving values to constraining facets.

Conceptually there is no difference between the built-in generated datatypes included in this specification and the user-generated datatypes which will be created by individual schema designers. The built-in generated datatypes are those which are believed to be so common that if they were not defined in this specification many schema designers would end up reinventing them. Furthermore, including these generated datatypes in this specification serves to demonstrate the mechanics and utility of the datatype generation facilities of this specification.

A datatype which is built-in in this specification need not be a "built-in" datatype in any programming language used to implement this specification.

3. Built-in datatypes

Issue (nulls): A future revision of this specification will provide a general mechanism for specifying the difference between "null" and "not present" for all datatypes. Exactly what that mechanism will be is an open issue at this point.

3.1 Namespace considerations

The built-in datatypes defined by this specification are designed so that systems other than XML Schema may be able to use them. To facilitate such usage, the built-in datatypes in this specification come from the XML Datatype namespace, the namespace defined by this specification. This applies to both built-in primitive and built-in generated datatypes.

NOTE: The exact URLs for the namespace(s) defined by this W3C specification is still an open issue. This issue has been raised with the XML Coordination Group (issue 1999-0201-07 Standardizing W3C namespace URIs) for general coordination and resolution.

Each user-generated datatype is also associated with a unique namespace. However, user-generated datatypes do not come from the XML Datatype namespace; rather, they come from the namespace of the schema in which they are defined. Note that associating a namespace with a user-generated datatype is not a general purpose extensibility mechanism and does not apply to primitive datatypes. Suppose a schema author wanted to introduce a new set of primitive datatypes, say a core set of mathematical datatypes not based on the Number datatype defined as a built-in primitive by this specification. Such a schema author might try to define those datatypes, associate a unique namespace with them and expect schema processors to understand them. Unfortunately, such a scenario would not work. Each such datatype would need specialized validation code and there are still many unresolved issues regarding standard mechanisms for sharing such code.

As described in more detail in [User-generated datatypes], each user-generated datatype must be defined in terms of a base type included in this specification or a user-generated datatype by assigning facets which serve to constrain the value set of the user-generated datatype to a subset of the base type. Such a mechanism works because all schema processors are required to be able to validate datatypes defined by subsetting the value space of a datatype included in this specification.

3.2 Primitive datatypes

The primitive datatypes are described below. For each primitive datatype we discuss the fundamental facets, if any, and the constraining facets, if any.

3.2.1 ID

This is the ID datatype from [XML]. It applies only to attribute values. ID has no fundamental or constraining facets.

Validity Constraint: ID

Values of type [ID] must match the Name production. A name must not appear more than once in an XML document as a value of this type; i.e., ID values must uniquely identify the elements which bear them.

Ed. Note: There are several situations in which we need better reference mechanisms than those provided by ID and IDREF/IDREFS. For example, it would be desirable to extend IDs and IDREFs to be typed and scoped to better represent primary key/foreign key relationships in a database. XSL has recently introduced the concept of xsl:key and xsl:keyref whereby a single property of an element can be used as a key. We need such a mechanism for XML as a whole and it would be nice if this were extended to support multi-part keys.

3.2.2 IDREF

This is the IDREF datatype from [XML]. It applies only to attribute values. IDREF has no fundamental or constraining facets.

Validity Constraint: IDREF

Values of type IDREF must match the Name production; each Name must match the value of an ID attribute on some element in the XML document; i.e. IDREF values must match the value of some ID attribute.

3.2.3 IDREFS

This is the IDREFS datatype from [XML]. It applies only to attribute values. IDREFS has no fundamental or constraining facets.

Validity Constraint: IDREFS

Values of type IDREFS must match Names; each Name must match the value of an ID attribute on some element in the XML document; i.e. IDREF values must match the value of some ID attribute.

3.2.4 ENTITY

This is the ENTITY datatype from [XML]. It applies only to attribute values. ENTITY has no fundamental or constraining facets.

Validity Constraint: Entity Name

Values of type ENTITY must match the Name production; each Name must match the name of an unparsed entity declared in the schema.

3.2.5 ENTITIES

This is the ENTITIES datatype from [XML]. It applies only to attribute values. ENTITIES has no fundamental or constraining facets.

Validity Constraint: Entity Names

Values of type ENTITIES must match the Names production; each Name must match the name of an unparsed entity declared in the schema.

3.2.6 NMTOKEN

This is the NMTOKEN datatype from [XML]. It restricts the contents to a valid XML name. NMTOKEN applies only to attribute values and has no fundamental or constraining facets.

Validity Constraint: Name Token

Values of type NMTOKEN must match the Nmtoken production; values of type NMTOKENS must match Nmtokens.

Issue (nmtoken-primitive-or-generated): should NMTOKEN be defined as a primitive (as above) or as a subtype of [string] with a regular expression facet such as "[a-zA-Z0-9_-]+" (or whatever the regular expression actually should be to match the Nmtoken production)? A similar issue also applies to all of the XML attribute types, [ID], [IDREF], [IDREFS], [ENTITY], [ENTITIES] and [NOTATION].

3.2.7 NMTOKENS

This is the NMTOKENS datatype from [XML]. It restricts the contents to a list of valid XML names. NMTOKENS applies only to attribute values and has no fundamental or constraining facets.

Validity Constraint: Name Tokens

Values of type NMTOKENS must match the Nmtokens production.

3.2.8 NOTATION

This is the NOTATION datatype from [XML]. NOTATION has no fundamental or constraining facets.

Validity Constraint: Notation Attributes

Values of this type must match one of the notation names included in the declaration; all notation names in the declaration must be declared.

3.2.9 string

[Definition: ] The string datatype represents character strings in XML. The value space of the string datatype is the set of finite sequences of UCS characters ([ISO 10646] and [Unicode]). A UCS character (or just character, for short) is an atomic unit of communication; it is not further specified except to note that every UCS character has a corresponding UCS code point, which is an integer.

3.2.9.1 Lexical Representation

The string datatype has an optional constraining facet called [Lexical representation] The value of this facet is either a picture or a regular expression. Picture types are discussed in Appendix [Pictures] and regular expression constraints are discussed in Appendix [Regular Expressions]. If this facet is not present, there is no restriction on the lexical representation.

Issue (picture-or-regex): Should the values of the [Lexical representation] facet be pictures, regexs, both or some other mechanism?

3.2.9.2 Length

The string datatype has an optional constraining facet called length. If length is specified we have a fixed length character string. If length is not specified we have a variable length character string.

3.2.9.3 Maximum Length

The string datatype has an optional constraining facet called maxlength. If maxlength is specified for a variable length string it represents an upper bound of the length of the string. Both length and maxlength cannot be defined for the same datatype. The absolute maximum length of variable length character string depends on the XML parser implementation.

3.2.9.4 Maximum and Minimum Values

The string datatype also has the following constraining facets:

maxInclusive
maxExclusive
minInclusive
minExclusive

Clearly, the effect of these constraining facets depends on the collating sequence used to define the order property for strings.

Ed. Note: The issue of collating sequences for strings is complex. It will be discussed in detail in a subsequent version of this specification.

3.2.10 boolean

[Definition: ] The boolean datatype has the value space required to support the mathematical concept of binary-valued logic: {true, false}.

Issue (three-valued-logic): Do we need to add a third value "unknown" to the value space to support three-valued logic? SQL supports this. Will the general mechanism for "nulls" to be defined in a future revision handle this case?

3.2.10.1 Lexical Representation

An instance of a datatype that is defined as boolean can have the following legal lexical values {0, 1, true, false, yes, no}. If a lexical representation facet is not present in the datatype definition then all these lexical values are allowed. A lexical representation facet can be added to the datatype definition to restrict the lexical values to a subset of the above.

3.2.11 number

[Definition: ] The number datatype is the standard mathematical concept of number, including the integers, reals, rationals, etc.

Number has the following constraining facets:

maxInclusive
maxExclusive
minInclusive
minExclusive

Ed. Note: The motivation for the number datatype was to allow the user to specify a value that could take any legal lexical representation for a numeric quantity. This turns out to be problematic, however, due to the large number of extant representations some of which cannot be distinguished without extra information. Having to define a lexical representation facet for number seems to defeat its purpose and, thus, the number datatype may be removed from this specification.

3.2.12 dateTime

[Definition: ] The dateTime datatype represents a combination of date and time values as defined [SQL] and in [ISO 8601] encoded as a single string. An optional facet specifies the lexical representation. If this is not specified, all lexical representation formats conforming to [ISO 8601] Section 5 and [SQL] are acceptable.

Issue (non-gregorian-dates): Both standards are limited to Gregorian dates. As an internalization issue, do we want support for non-gregorian dates? This issue also applies to [date], [time] and [timePeriod].

3.2.12.1 Lexical Representation

If this facet is specified its value must correspond to a legal representation for combinations of dates and times as defined in [SQL] and sections 5.1 and 5.4 of [ISO 8601].

Issue (dateTime-lexical-representation): We need to spell out the various SQL and ISO 8601 representations (e.g., CCYYMMDD and CCYY-MM-DD, etc.) in detail here, or in a (non-normative) appendix. We may also want to support additional formats e.g. neither SQL or ISO 8601 seems to support the 12/25/1999 format for date. A lexical representation for dateTime as a collection of elements may also be desirable. This issue also applies to [date], [time] and [timePeriod].

3.2.13 binary

[Definition: ] The binary datatype represents strings (blobs) of binary data. It has two fundamental facets. The optional length facet specifies the length of the data. If the length is not specified the default is unlimited length. The optional "encoding" facet specifies the encoding which may be "hex" for hexadecimal digits or "base64" for MIME style Base64 data.

3.2.14 uri

[Definition: ] The uri datatype represents a Universal Resource Identifier (URI) Reference as defined in [RFC 2396]. It has no fundamental or constraining facets.

Issue (uri-scheme-facet): should we have a facet to allow a limitation to a specific scheme? It might be useful to able to say that something was not only a URI, but that it was a "mailto" and not a "http://...".

3.3 Generated datatypes

This section gives conceptual definitions for all built-in generated datatypes defined by this specification, including a description of the facets which apply to each datatype. The concrete syntax used to define generated datatypes (whether built-in or user-generated) is given in section [User-generated datatypes] and the complete definitions of the built-in generated datatypes (written in that concrete syntax) are provided in Appendix [Built-in Generated Datatype Definitions (normative)].

3.3.1 integer

[Definition: ] The integer datatype corresponds to the standard mathematical concept of integer numbers. The value space of the integer datatype is the infinite set {-∞,...,-2,-1,0,1,2,...,∞} although computer implementations restrict this to a finite set. The basetype of integer is [number].

Integer has the following constraining facets:

maxInclusive
maxExclusive
minInclusive
minExclusive

3.3.1.1 Lexical representation

If this optional required facet is not specified, standard integer representations are acceptable. These consist of a string of digits with an optional sign or optional parentheses to indicate a negative number. Optional commas may appear after every three digits. For example: -1, 12678967543233, (100,000).

This facet must be specified if other lexical representations are desired such as the European format that allows periods after every three digits.

3.3.2 decimal

[Definition: ] The decimal datatype restricts allowable values to numbers with an exact fractional part. The basetype of decimal is [number].

Decimal has the following required fundamental facets:

precision: the total number of digits in the number.
scale: the number of digits after the decimal point. Must be less than or equal to precision.

Decimal has the following constraining facets:

maxInclusive
maxExclusive
minInclusive
minExclusive

3.3.2.1 Lexical representation

If this optional required facet is not specified, standard decimal representations are acceptable. These consist of a string of digits separated by a period as a decimal indicator, in accordance with the scale and precision facets, with an optional sign or optional parentheses to indicate a negative number. Optional commas may appear after every three digits. For example: -1.23, 12678967.543233, (100,000.00).

This facet must be specified if other lexical representations are desired such as the European format that uses a comma instead of a period as the decimal indicator.

3.3.3 real

[Definition: ] The real datatype is a computational approximation to the standard mathematical concept of real numbers. These are often called floating point numbers. The basetype of real is [number].

Real has the following constraining facets:

maxInclusive
maxExclusive
minInclusive
minExclusive

3.3.3.1 Lexical representation

Real values have a single standard lexical representation consisting of a mantissa followed by the character "E" followed by an exponent. The exponent must be an integer with optional sign without parentheses or commas. The mantissa must be a decimal number with optional sign without parentheses or commas. For example: -1E4, 1267.43233E12, 12.78E-2.

3.3.4 date

[Definition: ] The date datatype represents a date value as defined in [ISO 8601] encoded as a single string. The basetype of date is [dateTime]. An optional fundamental facet specifies the lexical format. If this is not specified, all lexical representation formats conforming to [ISO 8601] Section 5.2 and [SQL] are acceptable. For example, 1985-04-12, 19850412.

Issue (other-date-representations): Both ISO and SQL allow only the minus sign, "-", as separator? Do we want to allow other separators such as the solidus, "/" or colon, ":" ? We also need to discuss the aggregate representation for dates.

3.3.4.1 Lexical Representation

This optional facet can be used to restrict the allowed lexical representations. Its value must correspond to a legal representation for dates as defined Section 5.2 of [ISO 8601] or [SQL].

3.3.5 time

[Definition: ] The time datatype represents a time value as defined in [ISO 8601] encoded as a single string. The basetype of date is [dateTime]. An optional fundamental facet specifies the lexical format. If this is not specified, all lexical representation formats conforming to [ISO 8601] Section 5.3 and [SQL] are acceptable. For example, 23:20:50, 232050.

3.3.5.1 Lexical Representation

This optional facet can be used to restrict the allowed lexical representations. Its value must correspond to a legal representation for time as defined Section 5.3 of [ISO 8601] or [SQL].

3.3.6 timePeriod

[Definition: ] The timePeriod datatype represents a period of time as defined in [ISO 8601] encoded as a single string. The basetype of date is [dateTime]. A timePeriod is one of:

a duration of time with a specific start and end. For example, 19990412T232050/19990415T021000, in [ISO 8601] syntax, where the start and end are separated by the solidus, "/", and the date and time by the letter "T".
a duration of time without a specified start and end (e.g., 1 second, 3 months) This corresponds to the [SQL] datatype "interval". For example, 2001-01-05-12.00.00.0000 in SQL syntax.
a duration of time with a specific start but not a specific end. For example,19990412T232050/P1Y3M15D12H30M in [ISO 8601] syntax.
a duration of time with a specific end but not a specific start, For example, P1Y3M15D12H30M/19990415T021000 in [ISO 8601] syntax.

The [ISO 8601] formats need to be extended to support:

A minus sign immediately following the "P" to indicate negative periods.

An optional fundamental facet specifies the lexical format. If this is not specified, all lexical representation formats conforming to [ISO 8601] Section 5.5 and [SQL] are acceptable.

3.3.6.1 Lexical Representation

If this facet is specified its value must correspond to a legal representation for time periods as defined in section 5.5 of [ISO 8601] or [SQL].

4. User-generated datatypes

A user-generated datatype can be defined from a built-in datatype by adding optional constraining facets. For example, someone may want to define a datatype called heightInInches from the built-in datatype integer by supplying maxInclusive and minInclusive facets. In this case, heightInInches is the name of the new user-generated datatype, integer is its base type and maxInclusive and minInclusive are the constraining facets.

<datatype name="heightInInches">
   <basetype name="real" URI="http://www.w3.org/xmlschemas/datatypes" />
   <minInclusive>
      0.0
   </minInclusive>
   <maxInclusive>
      120.0
   </maxInclusive>
</datatype>

Ed. Note: The abstract syntax proposed here (and the productions) are preliminary as they allow datatype definitions which are logically inconsistent (e.g., they allow numeric facets on non-numeric datatypes). This will be corrected in future drafts, as the XML Schema language comes to allow the specification of tighter constraints.

Ed. Note: This section needs more explanatory text describing the productions and their relationship to the conceptual framework described in sections [Type System] and [Built-in datatypes].

Datatype definitions

`[1]`	`datatypeDefn`	`::=`	`NCName basetype facets*`	`[ Constraint: Unique datatype definitions ]`
`[2]`	`basetype`	`::=`	`datatypename`
`[3]`	`facets`	`::=`	`ordered \| unordered`	`[ Constraint: Appropriate facets ]`

The following is the definition for a possible built-in generated datatype "currency". Such a datatype definition could appear in the schema which defines datatypes for XML Schemas and shows that a generated datatype can have the same value space as its basetype, which might mean that it is just an alias or renaming of the basetype. In this case, the specification would probably also define some "semantics" for currency which went beyond those of decimal.

<datatype name="currency">
   <basetype name="dt:decimal"/>
</datatype>

Constraint: Unique datatype definitions

The name of the datatype being defined must be unique among the datatypes defined in the containing schema.

Constraint: Appropriate facets

If the value space of the basetype is ordered, then only ordered facets may appear in a datatype definition.

Datatype names

`[4]`	`datatypename`	`::=`	`builtinname \| usergenname`
`[5]`	`builtinname`	`::=`	`ID \| IDREF \| IDREFS \|`
			`NMTOKEN \| NMTOKENS \|`
			`ENTITY \| ENTITIES \|`
			`string \| uri \| binary \|`
			`number \| integer \| real \| decimal \|`
			`dateTime \| date \| time \| timePeriod`
`[6]`	`usergenname`	`::=`	`NCName schemaRef`	`[ Constraint: Datatype name ]`

NOTE: The production labeled datatypename above is not to be confused with that labeled datatypeName in Section 3.3.1 of [Structural Schemas].

Constraint: Datatype name

The name specified must be the name of a datatype defined in the schema in which the user-generated datatype is defined.

Facets

`[7]`	`ordered`	`::=`	`bounds \| numeric`
`[8]`	`unordered`	`::=`	`lexicalRepresentation \| enumeration \| length \| maxLength`

Ordered facets

`[9]`	`bounds`	`::=`	`(minInclusive \| maxInclusive)? (minExclusive \| maxExclusive)?`
`[10]`	`maxInclusive`	`::=`	`literalValue`	`[ Constraint: Literal type ]`
`[11]`	`minInclusive`	`::=`	`literalValue`	`[ Constraint: Literal type ]`
`[12]`	`minExclusive`	`::=`	`literalValue`	`[ Constraint: Literal type ]`
`[13]`	`maxExclusive`	`::=`	`literalValue`	`[ Constraint: Literal type ]`

Constraint: Literal type

The literal value specified must be of the same type as the basetype in the datatype definition in which this facet appears.

Numeric facets

`[14]`	`numeric`	`::=`	`precision \| scale`
`[15]`	`precision`	`::=`	`integerLiteral`
`[16]`	`scale`	`::=`	`integerLiteral`

The following is the definition for a user-generated datatype which could be used to represent monetary amounts for use in a financial management application which does not allow figures above $1M and allows only whole cents. This definition would appear in a schema authored by an end-user (i.e., not in the schema for schemas) and shows how to define a datatype by specifying facet values which constrain the range of the basetype in a manner specific to the basetype. This is different than specifying max/min values as discussed before.

This type could just as well have been defined with the potential built-in generated type "currency", defined above, as its basetype,

<datatype name="amount">
   <basetype name="decimal" URI="http://www.w3.org/xmlschemas/datatypes" />
   <precision>
      8
   </precision>
   <scale>
      2
   </scale>
</datatype>

Unordered facets

`[17]`	`length`	`::=`	`integerLiteral`
`[18]`	`maxLength`	`::=`	`integerLiteral`
`[19]`	`enumeration`	`::=`	`literal+`
`[20]`	`lexicalRepresentation`	`::=`	`lexical+`
`[21]`	`lexical`	`::=`	`lexicalSpec`	`[ Constraint: Lexical specification ]`

Constraint: Lexical specification

The lexical specification must be of the "correct" kind, i.e., a dateTime lexical representation for datatypes generated from [dateTime] etc.

The following example is a definition for a user-generated datatype which limits the possible lexical representations of dates to the two main forms found in [ISO 8601] section 5.2.1.1. This datatype definition would appear in a schema authored by an end-user and shows how to define a datatype by restricting the lexical form of its literals. The example also shows how this datatype would be used in an element definition.

<datatype name="myDate">
   <basetype name="date"  URI="http://www.w3.org/xmlschemas/datatypes" />
   <lexicalRepresenation>
      <lexical>
         CCYYMMDD
      </lexical>
      <lexical>
         CCYY-MM-DD
      </lexical>
   </lexicalRepresenation>
</datatype>

<elementType name="shippingDate">
   <datatypeRef name="myDate">
</elementType>

Given the definitions above, the following might occur in an instance document.

...
<shippingDate>19990510</shippingDate>
...
<shippingDate>1999-05-10</shippingDate>

Both of the above shipping dates refer to "abstract" date of May 10, 1999

The following example is a datatype definition for a user-generated datatype which limits the possible literal values of dates to the four US holidays enumerated. This datatype definition would appear in a schema authored by an end-user and shows how to define a datatype by enumerating the values in its value space. The enumerated values must be type-valid literals for the basetype.

<datatype name="holidays">
   <basetype name="date" URI="http://www.w3.org/xmlschemas/datatypes" />
   <enumeration>
      <literal>
         -0101    <!-- New Year's day -->
      </literal>
      <literal>
         -0704    <!-- 4th of july -->
      </literal>
      <literal>
         -1125    <!-- Thanksgiving -->
      </literal>
      <literal>
         -1225    <!-- Christmas -->
      </literal>
   </enumeration>
</datatype>

Literals

`[22]`	`literal`	`::=`	`literalValue`
`[23]`	`literalValue`	`::=`	`stringLiteral \| numericLiteral \| dateTimeLiteral \| uriLiteral`
`[24]`	`stringLiteral`	`::=`	`(see [string])`
`[25]`	`numericLiteral`	`::=`	`integerLiteral \| realLiteral \| decimalLiteral`
`[26]`	`integerLiteral`	`::=`	`(see [integer])`
`[27]`	`realLiteral`	`::=`	`see [real]`
`[28]`	`decimalLiteral`	`::=`	`see [decimal]`
`[29]`	`dateTimeLiteral`	`::=`	`(see [dateTime])`
`[30]`	`uriLiteral`	`::=`	`(see [uri])`

Issue (definition-overriding): should it be possible to specify a value for a non-fundamental facet on an element or attribute of a given datatype in an instance document or in an element or attribute definition in a schema (see Section 3.4.4 of [Structural Schemas])? If so, what syntax should be used? This needs to be coordinated with the structural schema editorial team.

5. Conformance

The XML specification [XML] defines two levels of conformance. Well-formed documents conform to valid XML syntax but may or may not obey the constraints defined by a DTD. Valid XML documents conform to the structure laid down in a DTD. Thus, if a DTD defines an attribute as an ID, instances of XML documents conforming to the DTD can only be valid if the values of such attributes are valid XML names and are unique in the document. By introducing additional datatypes to XML, this specification extends the notion of validity in the sense that values defined to have a certain datatype in the schema must conform to the lexical representations allowed for that datatype. It also needs to be said that that is all that is expected of an XML processor. There are no expressions on datatypes. Neither are there operations on datatypes.

In some cases, datatypes will not be specified in the schema but will be specified in XML documents. In other cases, datatypes in the documents will be specialized versions of datatypes specified for the same component in the schema. Validating XML processors should be able to validate the format of values in XML documents in these cases as well.

Appendices

A. Schema for Datatype Definitions (normative)

<?xml version='1.0'?> 
<!DOCTYPE schema PUBLIC '-//W3C//DTD XSDL 19990506//EN' 
                        'http://www.w3.org/1999/05/06-xsdl/WD-xsdl.dtd'>

<schema xmlns='http://www.w3.org/TR/1999/WD-xdtl-19990506.xsd'
        name='http://www.w3.org/TR/1999/WD-xdtl-19990506.xsd'
        version='0.1'>
<modelGroup name="ordered">
   <choice>
      <modelGroupRef name="bounds"/>
      <modelGroupRef name="numeric"/>
   </choice>
</modelGroup>
<modelGroup name="bounds">
   <choice>
      <sequence>
         <elementTypeRef name="minInclusive" minOccur="0" maxOccur="1"/>
         <elementTypeRef name="maxInclusive" minOccur="0" maxOccur="1"/>
      </sequence>
      <sequence>
         <elementTypeRef name="minExclusive" minOccur="0" maxOccur="1"/>
         <elementTypeRef name="maxExclusive" minOccur="0" maxOccur="1"/>
      </sequence>
   </choice>
</modelGroup>
<modelGroup name="numeric">
   <choice>
      <elementTypeRef name="precision"/>
      <elementTypeRef name="scale"/>
   </choice>
</modelGroup>
<modelGroup name="unordered">
   <choice>
      <modelGroupRef name="lexicalRepresentation"/>
      <modelGroupRef name="enumeration"/>
      <modelGroupRef name="length"/>
      <modelGroupRef name="maxLength"/>
   </choice>
</modelGroup>

<elementType name="datatype">
   <sequence>
      <elementTypeRef name="basetype"/>
      <choice minOccur="0" maxOccur="*">
         <modelGroupRef name="ordered"/>
         <modelGroupRef name="unordered"/>
      </choice>
   </sequence>
   <attrDecl name="name" required="true">
      <datatypeRef name="NMTOKEN"/>
   </attrDecl>
   <attrDecl name="export">
      <datatypeRef name="boolean">
         <default>true</default>
      </datatype>
   </attrDecl> 
</elementType>
<elementType name="basetype">
   <empty/>
   <attrDecl name="name" required="true">
      <datatypeRef name="NMTOKEN"/>
   </attrDecl>
   <attrDecl name="schemaAbbrev"> 
      <datatypeRef name="NMTOKEN"/> 
   </attrDecl>
   <attrDecl name="schemaName">
      <datatypeRef name="uri"/>
   </attrDecl>
</elementType>
<elementType name="maxExclusive">
   <datatypeRef name="string"/> <!-- the datatype depends on the basetype -->
</elementType>
<elementType name="minExclusive">
   <datatypeRef name="string"/> <!-- the datatype depends on the basetype -->
</elementType>
<elementType name="maxInclusive">
   <datatypeRef name="string"/> <!-- the datatype depends on the basetype -->
</elementType>
<elementType name="minInclusive">
   <datatypeRef name="string"/> <!-- the datatype depends on the basetype -->
</elementType>
<elementType name="precision">
   <datatypeRef name="integer"/>
</elementType>
<elementType name="scale">
   <datatypeRef name="integer"/>
</elementType>
<elementType name="length">
   <datatypeRef name="integer"/>
</elementType>
<elementType name="maxLength">
   <datatypeRef name="integer"/>
</elementType>
<elementType name="enumeration">
   <sequence minOccur="1" maxOccur="*">
      <elementTypeRef name="literal"/>
   </sequence>
</elementType>
<elementType name="literal">
   <datatypeRef name="string"/>  <!-- the datatype depends on the basetype -->
</elementType>
<elementType name="lexicalRepresentation">
   <sequence minOccur="1" maxOccur="*">
      <elementTypeRef name="lexical"/>
   </sequence>
</elementType>
<elementType name="lexical">
   <datatypeRef name="string"/>  <!-- the datatype depends on the basetype -->
</elementType>
</schema>

B. DTD for Datatype Definitions (normative)

<!ENTITY % numeric "precision | scale">
<!ENTITY % bounds "((minInclusive | minExclusive)?, 
  (maxInclusive | maxExclusive)?)">
<!ENTITY % ordered "%bounds; | %numeric;">
<!ENTITY % unordered "lexicalRepresentation | enumeration 
  | length | maxLength">

<!ELEMENT datatype (basetype, (%ordered; | %unordered;)*)>
<!ATTLIST datatype
	name NMTOKEN #REQUIRED
	export (true|false) "true">

<!ELEMENT basetype EMPTY>
<!ATTLIST basetype
   name NMTOKEN #REQUIRED
   schemaAbbrev NMTOKEN #IMPLIED
   schemaName CDATA #IMPLIED>

<!ELEMENT maxExclusive (#PCDATA)>
<!ELEMENT minExclusive (#PCDATA)>
<!ELEMENT maxInclusive (#PCDATA)>
<!ELEMENT minInclusive (#PCDATA)>

<!ELEMENT precision (#PCDATA)>
<!ELEMENT scale (#PCDATA)>

<!ELEMENT length (#PCDATA)>
<!ELEMENT maxLength (#PCDATA)>
<!ELEMENT enumeration (literal)+>
<!ELEMENT literal (#PCDATA)>
<!ELEMENT lexicalRepresentation (lexical)+>
<!ELEMENT lexical (#PCDATA)>

C. Built-in Generated Datatype Definitions (normative)

This section gives the datatype definitions for all built-in generated datatypes. These definitions are to appear in the "schema for schemas" and not in schema instances written by end-users.

Ed. Note: this section needs to be expanded to include all built-in generated datatypes defined in [Generated datatypes]

<datatype name="date">
   <basetype name="dateTime" URI="http://www.w3.org/xmlschemas/datatypes" />
   <lexicalRepresentation>
      <!--ISO 8601 section 5.2.1.1 and SQL-->
      <lexical>
         CCYYMMDD  <!-- 19850412 ==> April 12, 1985-->
      </lexical>
      <lexical>
         CCYY-MM-DD<!-- 1985-04-12 ==> April 12, 1985-->
      </lexical>
      <!--ISO 8601 section 5.2.1.2-->
      <lexical>
         CCYY-MM   <!-- 1985-04 ==> April, 1985-->
      </lexical>
      <lexical>
         CCYY      <!-- 1985 ==> 1985-->
      </lexical>
      <lexical>
         CC        <!-- 19 ==> the 1900's-->
      </lexical>
      <!--ISO 8601 section 5.2.1.3-->
      <lexical>
         YYMMDD    <!-- 850412 ==> April 12, '85 (in the current century)-->
      </lexical>
      <lexical>
         YY-MM-DD  <!-- 85-04-12 ==> April 12, '85 (in the current century)-->
      </lexical>
      <lexical>
         -YYMM     <!-- -8504 ==> April, '85 (in the current century)-->
      </lexical>
      <lexical>
         -YY-MM    <!-- -85-04 ==> April, '85 (in the current century)-->
      </lexical>
      <lexical>
         -YY       <!-- -85 ==> '85 (in the current century)-->
      </lexical>
      <lexical>
         --MMDD    <!-- --0412 ==> April 12-->
      </lexical>
      <lexical>
         --MM-DD   <!-- --04-12 ==> April 12-->
      </lexical>
      <lexical>
         --MM      <!-- --04 ==> April-->
      </lexical>
      <lexical>
         ---DD     <!-- ---12 ==> the 12th-->
      </lexical>
      <!--ISO 8601 section 5.2.2.1-->
      <lexical>
         CCYYDDD   <!-- 1985102 ==> April 12, 1985 (i.e., the 102nd day of 1985)-->
      </lexical>
      <lexical>
         CCYY-DDD  <!-- 1985-102 ==> April 12, 1985 (i.e., the 102nd day of 1985)-->
      </lexical>
      <!--ISO 8601 section 5.2.2.2-->
      <lexical>
         YYDDD     <!-- 85102 ==> April 12, '85 (in the current century) (i.e., the 102nd day of '85)-->
      </lexical>
      <lexical>
         YY-DDD    <!-- 85-102 ==> April 12, '85 (in the current century) (i.e., the 102nd day of '85)-->
      </lexical>
      <lexical>
         -DDD      <!-- -102 ==> April 12 (i.e., the 102nd day of the year)-->
      </lexical>
      <!--ISO 8601 section 5.2.3.1-->
      <lexical>
         CCYYWwwD  <!-- 1985W155 ==> April 12, 1985 (i.e., the 5th day of the 15th week of 1985)-->
      </lexical>
      <lexical>
         CCYY-Www-D<!-- 1985-W15-5 ==> April 12, 1985 (i.e., the 5th day of the 15th week of 1985)-->
      </lexical>
      <!--ISO 8601 section 5.2.3.2-->
      <lexical>
         CCYYWww   <!-- 1985W15 ==> the 15th week of 1985-->
      </lexical>
      <lexical>
         CCYY-Www  <!-- 1985-W15 ==> the 15th week of 1985-->
      </lexical>
      <!--ISO 8601 section 5.2.3.3-->
      <lexical>
         YYWwwD    <!-- 85W155 ==> April 12, '85 (in the current century) (i.e., the 5th day of the 15th week of '85)-->
      </lexical>
      <lexical>
         YY-Www-D  <!-- 85-W15-5 ==> April 12, '85 (in the current century) (i.e., the 5th day of the 15th week of '85)-->
      </lexical>
      <lexical>
         YYWww     <!-- 85W15 ==> the 15th week in the current century-->
      </lexical>
      <lexical>
         YY-Www    <!-- 85-W15 ==> the 15th week in the current century-->
      </lexical>
      <lexical>
         -YWwwD    <!-- -5W155 ==> April 12, ''5 (in the current decade)-->
      </lexical>
      <lexical>
         -Y-Www-D  <!-- -5W15-5 ==> April 12, ''5 (in the current decade)-->
      </lexical>
      <lexical>
         -WwwD     <!-- -W155 ==> April 12 (in the current year)-->
      </lexical>
      <lexical>
         -Www-D    <!-- -W15-5 ==> April 12 (in the current year)-->
      </lexical>
      <lexical>
         -Www      <!-- -W15 ==> 15th week in the current year-->
      </lexical>
      <lexical>
         -W-D      <!-- -W-5 ==> Friday (of the current week) (i.e., the 5th day)-->
      </lexical>
      <lexical>
         ---D      <!-- ---5 ==> Friday (of any week in any year)-->
      </lexical>
   </lexicalRepresentation>
</datatype>

D. Pictures

"Pictures", similar to those in [COBOL] picture clauses, can be used to constrain the format of strings and in some cases control their conversion to numbers. A picture is an alphanumeric string consisting of character symbols. Each symbol, which is usually one character but may be two characters, is a placeholder that stands for a set of characters. For example, the picture "A" stands for a single alphabetic character.

The following is a list of picture symbols and their meanings.

A: A single alphabetic character.
B: A single blank character.
E: The character E, used to indicate floating point numbers.
S: The leftmost character of a picture indicating a signed number. The characters "+" or "-" may appear in the S position.
V: An implied decimal sign. The input 1234 validated by a picture 99V99 is converted into 12.34.
X: Any character.
Z: The leftmost leading numeric character that can be replaced by a space character when the content of that content position is a zero.
9: Any numeric character.
1: Any boolean character (0 or 1).
0,/,-,., and ,: represent themselves.
cs: A placeholder for an appropriate currency symbol.

Here are some examples of picture constraints

$123,45.90 satisfies picture $999,99.99 $123,45.90 satisfies picture XXXX,XX.XX 123-45-5678 satisfies picture 999-99-9999 (Social Security Number) 24E80 satisfies picture 99E99 (floating point) 23.45 satisfies picture 99.99 2345 satisfies picture 99V99 (translates to 23.45)

E. Regular Expressions

Ed. Note: The following description of regular expressions is copied (with slight modification) by permission from the documentation of the [Perl] programming language.

Issue (perl-regex): Should the final recommendation use Perl's regular expression "extensions"?

[Definition: ] Regular expressions, similar to those in [Perl], can be used to constrain the format of strings. A regular expression is an alphanumeric string consisting of character symbols. Each symbol, which is usually one character but may be two characters, is a placeholder that stands for a set of characters.

Any single character matches itself, unless it is a metacharacter with a special meaning described here or above. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a "\" (e.g., "\." matches a ".", not any character; "\\" matches a "\"). A series of characters matches that series of characters in the target string, so the pattern blurfl would match "blurfl" in the target string.

You can specify a character class, by enclosing a list of characters in [], which will match any one character from the list. If the first character after the "[" is "^", the class matches any character not in the list. Within a list, the "-" character is used to specify a range, so that a-z represents all characters between "a" and "z", inclusive. If you want "-" itself to be a member of a class, put it at the start or end of the list, or escape it with a backslash. (The following all specify the same class of three characters: [-az], [az-], and [a\-z]. All are different from [a-z], which specifies a class containing twenty-six characters.)

Certain characters as used as metacharacters. The following list contains all of the metacharacters and their meanings.

\: Quote the next metacharacter
^: Match the beginning of the line
.: Match any character (except newline)
$: Match the end of the line (or before newline at the end)
|: Alternation
(): Grouping
[]: Character class

Within a regular expression, the following standard quantifiers are recognized:

*: Match 0 or more times
+: Match 1 or more times
?: Match 1 or 0 times
{n}: Match exactly n times
{n,}: Match at least n times
{n,m}: Match at least n but not more than m times

The following character sequences also have special meaning within a regular expression.

\t: tab
\n: newline
\r: return
\033: octal char 003
\x1B: hex char 1B
\w: Match a "word" character (alphanumeric plus "_")
\W: Match a non-word character
\s: Match a whitespace character
\S: Match a non-whitespace character
\d: Match a digit character
\D: Match a non-digit character

Ed. Note: we should probably define XML-specific character sequences for things like Nmtoken, Name, etc., as well as ones for the character classes listed in XML 1.0 Appendix B. Character Classes

Regular expressions may also contain the following zero-width assertions:

\b: Match a word boundary
\B: Match a non-(word boundary)

A word boundary (\b) is defined as a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

  555-1212     is matched by \d{3}-\d{4}           (phone number)
  888-555-1212 is matched by (\d{3}-)?\d{3}-\d{4}  (phone number with optional area code)
  $123,45.90   is matched by \$\d{3},\d{2}\.\d{2}
  123-45-5678  is matched by \d{3}-?\d{2}-?\d{4}   (Social Security Number)

F. References

COBOL

COBOL Standard. See http://www.dkuug.dk/jtc1/sc22/wg4/

ISO 10646

ISO (International Organization for Standardization). ISO/IEC 10646-1993 (E). Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. [Geneva]: International Organization for Standardization, 1993 (plus amendments AM 1 through AM 7).

ISO 11404