W3C

XML Schema Part 2: Datatypes Second Edition

W3C Proposed Edited Recommendation 18 March 2004

This version:
http://www.w3.org/TR/2004/PER-xmlschema-2-20040318/
Latest version:
http://www.w3.org/TR/xmlschema-2/
Previous version:
http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/
Editors:
Paul V. Biron, Kaiser Permanente, for Health Level Seven <Paul.V.Biron@kp.org>
Ashok Malhotra, Microsoft, formerly of IBM <ashokma@microsoft.com>

This document is also available in these non-normative formats: XML, XHTML with visible change markup, Independent copy of the schema for schema documents, A schema for built-in datatypes only, in a separate namespace, and Independent copy of the DTD for schema documents.


Abstract

XML Schema: Datatypes is part 2 of the specification of the XML Schema language. It defines facilities for defining datatypes to be used in XML Schemas as well as other XML specifications. The datatype language, which is itself represented in XML 1.0, provides a superset of the capabilities found in XML 1.0 document type definitions (DTDs) for specifying datatypes on elements and attributes.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is a W3C Proposed Edited Recommendation, intended to become the third part of the Second Edition of XML Schema. It is here made available for review by W3C members and other interested parties. Note that a Candidate Recommendation draft has not been deemed necessary by the Working Group, as there are no substantial implementation issues arising as a result of this edition, which aims only to incorporate the published corrigenda to the first edition.

Please send comments on this Proposed Edited Recommendation to www-xml-schema-comments@w3.org, including 2E PER in the subject line, no later than 16 April 2004.

Publication as a Proposed Edited Recommendation does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document has been produced by the W3C XML Schema Working Group as part of the W3C XML Activity. The goals of the XML Schema language are discussed in the XML Schema Requirements document. The authors of this document are the members of the XML Schema Working Group. Different parts of this specification have different editors.

Documentation of intellectual property possibly relevant to this recommendation may be found at the Working Group's public IPR disclosure page.

The English version of this specification is the only normative version. Information about translations of this document is available at http://www.w3.org/2001/05/xmlschema-translations.

This second edition is not a new version, it merely incorporates the changes dictated by the corrections to errors found in the first edition as agreed by the XML Schema Working Group, as a convenience to readers. A separate list of all such corrections is available at http://www.w3.org/2001/05/xmlschema-errata. The errata list for this second edition is available at http://www.w3.org/2004/03/xmlschema-errata.

Please report errors in this document to www-xml-schema-comments@w3.org (archive).

Table of Contents

1 Introduction
    1.1 Purpose
    1.2 Requirements
    1.3 Scope
    1.4 Terminology
    1.5 Constraints and Contributions
2 Type System
    2.1 Datatype
    2.2 Value space
    2.3 Lexical space
    2.4 Facets
    2.5 Datatype dichotomies
3 Built-in datatypes
    3.1 Namespace considerations
    3.2 Primitive datatypes
    3.3 Derived datatypes
4 Datatype components
    4.1 Simple Type Definition
    4.2 Fundamental Facets
    4.3 Constraining Facets
5 Conformance

Appendices

A Schema for Datatype Definitions (normative)
B DTD for Datatype Definitions (non-normative)
C Datatypes and Facets
    C.1 Fundamental Facets
D ISO 8601 Date and Time Formats
    D.1 ISO 8601 Conventions
    D.2 Truncated and Reduced Formats
    D.3 Deviations from ISO 8601 Formats
E Adding durations to dateTimes
    E.1 Algorithm
    E.2 Commutativity and Associativity
F Regular Expressions
    F.1 Character Classes
G Glossary (non-normative)
H References
    H.1 Normative
    H.2 Non-normative
I Acknowledgements (non-normative)


1 Introduction

next sub-section1.1 Purpose

The [XML 1.0 (Second Edition)] specification defines limited facilities for applying datatypes to document content in that documents may contain or refer to DTDs that assign types to elements and attributes. However, document authors, including authors of traditional documents and those transporting data in XML, often require a higher degree of type checking to ensure robustness in document understanding and data interchange.

The table below offers two typical examples of XML instances in which datatypes are implicit: the instance on the left represents a billing invoice, the instance on the right a memo or perhaps an email message in XML.

Data orientedDocument oriented
<invoice>
  <orderDate>1999-01-21</orderDate>
  <shipDate>1999-01-25</shipDate>
  <billingAddress>
   <name>Ashok Malhotra</name>
   <street>123 Microsoft Ave.</street>
   <city>Hawthorne</city>
   <state>NY</state>
   <zip>10532-0000</zip>
  </billingAddress>
  <voice>555-1234</voice>
  <fax>555-4321</fax>
</invoice>
<memo importance='high'
      date='1999-03-23'>
  <from>Paul V. Biron</from>
  <to>Ashok Malhotra</to>
  <subject>Latest draft</subject>
  <body>
    We need to discuss the latest
    draft <emph>immediately</emph>.
    Either email me at <email>
    mailto:paul.v.biron@kp.org</email>
    or call <phone>555-9876</phone>
  </body>
</memo>

The invoice contains several dates and telephone numbers, the postal abbreviation for a state (which comes from an enumerated list of sanctioned values), and a ZIP code (which takes a definable regular form). The memo contains many of the same types of information: a date, telephone number, email address and an "importance" value (from an enumerated list, such as "low", "medium" or "high"). Applications which process invoices and memos need to raise exceptions if something that was supposed to be a date or telephone number does not conform to the rules for valid dates or telephone numbers.

In both cases, validity constraints exist on the content of the instances that are not expressible in XML DTDs. The limited datatyping facilities in XML have prevented validating XML processors from supplying the rigorous type checking required in these situations. The result has been that individual applications writers have had to implement type checking in an ad hoc manner. This specification addresses the need of both document authors and applications writers for a robust, extensible datatype system for XML which could be incorporated into XML processors. As discussed below, these datatypes could be used in other XML-related standards as well.

previous sub-section next sub-section1.2 Requirements

The [XML Schema Requirements] document spells out concrete requirements to be fulfilled by this specification, which state that the XML Schema Language must:

  1. provide for primitive data typing, including byte, date, integer, sequence, SQL and Java primitive datatypes, etc.;
  2. define a type system that is adequate for import/export from database systems (e.g., relational, object, OLAP);
  3. distinguish requirements relating to lexical data representation vs. those governing an underlying information set;
  4. allow creation of user-defined datatypes, such as datatypes that are derived from existing datatypes and which may constrain certain of its properties (e.g., range, precision, length, format).

previous sub-section next sub-section1.3 Scope

This portion of the XML Schema Language discusses datatypes that can be used in an XML Schema. These datatypes can be specified for element content that would be specified as #PCDATA and attribute values of various types in a DTD. It is the intention of this specification that it be usable outside of the context of XML Schemas for a wide range of other XML-related activities such as [XSL] and [RDF Schema].

previous sub-section next sub-section1.4 Terminology

The terminology used to describe XML Schema Datatypes is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of a datatype processor:

[Definition:]   for compatibility
A feature of this specification included solely to ensure that schemas which use this feature remain compatible with [XML 1.0 (Second Edition)]
[Definition:]  may
Conforming documents and processors are permitted to but need not behave as described.
[Definition:]  match
(Of strings or names:) Two strings or names being compared must be identical. Characters with multiple possible representations in ISO/IEC 10646 (e.g. characters with both precomposed and base+diacritic forms) match only if they have the same representation in both strings. No case folding is performed. (Of strings and rules in the grammar:) A string matches a grammatical production if it belongs to the language generated by that production.
[Definition:]  must
Conforming documents and processors are required to behave as described; otherwise they are in ·error·.
[Definition:]  error
A violation of the rules of this specification; results are undefined. Conforming software ·may· detect and report an error and ·may· recover from it.

previous sub-section 1.5 Constraints and Contributions

This specification provides three different kinds of normative statements about schema components, their representations in XML and their contribution to the schema-validation of information items:

[Definition:]   Constraint on Schemas
Constraints on the schema components themselves, i.e. conditions components ·must· satisfy to be components at all. Largely to be found in Datatype components (§4).
[Definition:]   Schema Representation Constraint
Constraints on the representation of schema components in XML. Some but not all of these are expressed in Schema for Datatype Definitions (normative) (§A) and DTD for Datatype Definitions (non-normative) (§B).
[Definition:]   Validation Rule
Constraints expressed by schema components which information items ·must· satisfy to be schema-valid. Largely to be found in Datatype components (§4).

2 Type System

This section describes the conceptual framework behind the type system defined in this specification. The framework has been influenced by the [ISO 11404] standard on language-independent datatypes as well as the datatypes for [SQL] and for programming languages such as Java.

The datatypes discussed in this specification are computer representations of well known abstract concepts such as integer and date. It is not the place of this specification to define these abstract concepts; many other publications provide excellent definitions.

next sub-section2.1 Datatype

[Definition:]  In this specification, a datatype is a 3-tuple, consisting of a) a set of distinct values, called its ·value space·, b) a set of lexical representations, called its ·lexical space·, and c) a set of ·facet·s that characterize properties of the ·value space·, individual values or lexical items.

previous sub-section next sub-section2.2 Value space

[Definition:]  A value space is the set of values for a given datatype. Each value in the value space of a datatype is denoted by one or more literals in its ·lexical space·.

The ·value space· of a given datatype can be defined in one of the following ways:

  • defined axiomatically from fundamental notions (intensional definition) [see ·primitive·]
  • enumerated outright (extensional definition) [see ·enumeration·]
  • defined by restricting the ·value space· of an already defined datatype to a particular subset with a given set of properties [see ·derived·]
  • defined as a combination of values from one or more already defined ·value space·(s) by a specific construction procedure [see ·list· and ·union·]

·value space·s have certain properties. For example, they always have the property of ·cardinality·, some definition of equality and might be ·ordered·, by which individual values within the ·value space· can be compared to one another. The properties of ·value space·s that are recognized by this specification are defined in Fundamental facets (§2.4.1).

previous sub-section next sub-section2.3 Lexical space

In addition to its ·value space·, each datatype also has a lexical space.

[Definition:]  A lexical space is the set of valid literals for a datatype.

For example, "100" and "1.0E2" are two different literals from the ·lexical space· of float which both denote the same value. The type system defined in this specification provides a mechanism for schema designers to control the set of values and the corresponding set of acceptable literals of those values for a datatype.

Note:  The literals in the ·lexical space·s defined in this specification have the following characteristics:
Interoperability:
The number of literals for each value has been kept small; for many datatypes there is a one-to-one mapping between literals and values. This makes it easy to exchange the values between different systems. In many cases, conversion from locale-dependent representations will be required on both the originator and the recipient side, both for computer processing and for interaction with humans.
Basic readability:
Textual, rather than binary, literals are used. This makes hand editing, debugging, and similar activities possible.
Ease of parsing and serializing:
Where possible, literals correspond to those found in common programming languages and libraries.

2.3.1 Canonical Lexical Representation

While the datatypes defined in this specification have, for the most part, a single lexical representation i.e. each value in the datatype's ·value space· is denoted by a single literal in its ·lexical space·, this is not always the case. The example in the previous section showed two literals for the datatype float which denote the same value. Similarly, there ·may· be several literals for one of the date or time datatypes that denote the same value using different timezone indicators.

[Definition:]  A canonical lexical representation is a set of literals from among the valid set of literals for a datatype such that there is a one-to-one mapping between literals in the canonical lexical representation and values in the ·value space·.

previous sub-section next sub-section2.4 Facets

[Definition:]  A facet is a single defining aspect of a ·value space·. Generally speaking, each facet characterizes a ·value space· along independent axes or dimensions.

The facets of a datatype serve to distinguish those aspects of one datatype which differ from other datatypes. Rather than being defined solely in terms of a prose description the datatypes in this specification are defined in terms of the synthesis of facet values which together determine the ·value space· and properties of the datatype.

Facets are of two types: fundamental facets that define the datatype and non-fundamental or constraining facets that constrain the permitted values of a datatype.

2.4.1 Fundamental facets

[Definition:]   A fundamental facet is an abstract property which serves to semantically characterize the values in a ·value space·.

All fundamental facets are fully described in Fundamental Facets (§4.2).

2.4.2 Constraining or Non-fundamental facets

[Definition:]  A constraining facet is an optional property that can be applied to a datatype to constrain its ·value space·.

Constraining the ·value space· consequently constrains the ·lexical space·. Adding ·constraining facet·s to a ·base type· is described in Derivation by restriction (§4.1.2.1).

All constraining facets are fully described in Constraining Facets (§4.3).

previous sub-section 2.5 Datatype dichotomies

It is useful to categorize the datatypes defined in this specification along various dimensions, forming a set of characterization dichotomies.

2.5.1 Atomic vs. list vs. union datatypes

The first distinction to be made is that between ·atomic·, ·list· and ·union· datatypes.

For example, a single token which ·match·es Nmtoken from [XML 1.0 (Second Edition)] could be the value of an ·atomic· datatype (NMTOKEN); while a sequence of such tokens could be the value of a ·list· datatype (NMTOKENS).

2.5.1.1 Atomic datatypes

·atomic· datatypes can be either ·primitive· or ·derived·. The ·value space· of an ·atomic· datatype is a set of "atomic" values, which for the purposes of this specification, are not further decomposable. The ·lexical space· of an ·atomic· datatype is a set of literals whose internal structure is specific to the datatype in question.

2.5.1.2 List datatypes

Several type systems (such as the one described in [ISO 11404]) treat ·list· datatypes as special cases of the more general notions of aggregate or collection datatypes.

·list· datatypes are always ·derived·. The ·value space· of a ·list· datatype is a set of finite-length sequences of ·atomic· values. The ·lexical space· of a ·list· datatype is a set of literals whose internal structure is a space-separated sequence of literals of the ·atomic· datatype of the items in the ·list·.

[Definition:]   The ·atomic· or ·union· datatype that participates in the definition of a ·list· datatype is known as the itemType of that ·list· datatype.

Example
<simpleType name='sizes'>
  <list itemType='decimal'/>
</simpleType>
<cerealSizes xsi:type='sizes'> 8 10.5 12 </cerealSizes>

A ·list· datatype can be ·derived· from an ·atomic· datatype whose ·lexical space· allows space (such as string or anyURI)or a ·union· datatype any of whose {member type definitions}'s ·lexical space· allows space. In such a case, regardless of the input, list items will be separated at space boundaries.

Example
<simpleType name='listOfString'>
  <list itemType='string'/>
</simpleType>
<someElement xsi:type='listOfString'>
this is not list item 1
this is not list item 2
this is not list item 3
</someElement>
In the above example, the value of the someElement element is not a ·list· of ·length· 3; rather, it is a ·list· of ·length· 18.

When a datatype is ·derived· from a ·list· datatype, the following ·constraining facet·s apply:

For each of ·length·, ·maxLength· and ·minLength·, the unit of length is measured in number of list items. The value of ·whiteSpace· is fixed to the value collapse.

For ·list· datatypes the ·lexical space· is composed of space-separated literals of its ·itemType·. Hence, any ·pattern· specified when a new datatype is ·derived· from a ·list· datatype is matched against each literal of the ·list· datatype and not against the literals of the datatype that serves as its ·itemType·.

Example
<xs:simpleType name='myList'>
	<xs:list itemType='xs:integer'/>
</xs:simpleType>
<xs:simpleType name='myRestrictedList'>
	<xs:restriction base='myList'>
		<xs:pattern value='123 (\d+\s)*456'/>
	</xs:restriction>
</xs:simpleType>
<someElement xsi:type='myRestrictedList'>123 456</someElement>
<someElement xsi:type='myRestrictedList'>123 987 456</someElement>
<someElement xsi:type='myRestrictedList'>123 987 567 456</someElement>

The canonical-lexical-representation for the ·list· datatype is defined as the lexical form in which each item in the ·list· has the canonical lexical representation of its ·itemType·.

2.5.1.3 Union datatypes

The ·value space· and ·lexical space· of a ·union· datatype are the union of the ·value space·s and ·lexical space·s of its ·memberTypes·. ·union· datatypes are always ·derived·. Currently, there are no ·built-in· ·union· datatypes.

Example
A prototypical example of a ·union· type is the maxOccurs attribute on the element element in XML Schema itself: it is a union of nonNegativeInteger and an enumeration with the single member, the string "unbounded", as shown below.
  <attributeGroup name="occurs">
    <attribute name="minOccurs" type="nonNegativeInteger"
    	use="optional" default="1"/>
    <attribute name="maxOccurs"use="optional" default="1">
      <simpleType>
        <union>
          <simpleType>
            <restriction base='nonNegativeInteger'/>
          </simpleType>
          <simpleType>
            <restriction base='string'>
              <enumeration value='unbounded'/>
            </restriction>
          </simpleType>
        </union>
      </simpleType>
    </attribute>
  </attributeGroup>

Any number (greater than 1) of ·atomic· or ·list· ·datatype·s can participate in a ·union· type.

[Definition:]   The datatypes that participate in the definition of a ·union· datatype are known as the memberTypes of that ·union· datatype.

The order in which the ·memberTypes· are specified in the definition (that is, the order of the <simpleType> children of the <union> element, or the order of the QNames in the memberTypes attribute) is significant. During validation, an element or attribute's value is validated against the ·memberTypes· in the order in which they appear in the definition until a match is found. The evaluation order can be overridden with the use of xsi:type.

Example
For example, given the definition below, the first instance of the <size> element validates correctly as an integer (§3.3.13), the second and third as string (§3.2.1).
  <xsd:element name='size'>
    <xsd:simpleType>
      <xsd:union>
        <xsd:simpleType>
          <xsd:restriction base='integer'/>
        </xsd:simpleType>
        <xsd:simpleType>
          <xsd:restriction base='string'/>
        </xsd:simpleType>
      </xsd:union>
    </xsd:simpleType>
  </xsd:element>
  <size>1</size>
  <size>large</size>
  <size xsi:type='xsd:string'>1</size>

The canonical-lexical-representation for a ·union· datatype is defined as the lexical form in which the values have the canonical lexical representation of the appropriate ·memberTypes·.

Note:  A datatype which is ·atomic· in this specification need not be an "atomic" datatype in any programming language used to implement this specification. Likewise, a datatype which is a ·list· in this specification need not be a "list" datatype in any programming language used to implement this specification. Furthermore, a datatype which is a ·union· in this specification need not be a "union" datatype in any programming language used to implement this specification.

2.5.2 Primitive vs. derived datatypes

Next, we distinguish between ·primitive· and ·derived· datatypes.

  • [Definition:]  Primitive datatypes are those that are not defined in terms of other datatypes; they exist ab initio.
  • [Definition:]  Derived datatypes are those that are defined in terms of other datatypes.

For example, in this specification, float is a well-defined mathematical concept that cannot be defined in terms of other datatypes, while a integer is a special case of the more general datatype decimal.

[Definition:]   The simple ur-type definition is a special restriction of the ur-type definition whose name is anySimpleType in the XML Schema namespace. anySimpleType can be considered as the ·base type· of all ·primitive· datatypes. anySimpleType is considered to have an unconstrained lexical space and a ·value space· consisting of the union of the ·value space·s of all the ·primitive· datatypes and the set of all lists of all members of the ·value space·s of all the ·primitive· datatypes.

The datatypes defined by this specification fall into both the ·primitive· and ·derived· categories. It is felt that a judiciously chosen set of ·primitive· datatypes will serve the widest possible audience by providing a set of convenient datatypes that can be used as is, as well as providing a rich enough base from which the variety of datatypes needed by schema designers can be ·derived·.

In the example above, integer is ·derived· from decimal.

Note:  A datatype which is ·primitive· in this specification need not be a "primitive" datatype in any programming language used to implement this specification. Likewise, a datatype which is ·derived· in this specification need not be a "derived" datatype in any programming language used to implement this specification.

As described in more detail in XML Representation of Simple Type Definition Schema Components (§4.1.2), each ·user-derived· datatype ·must· be defined in terms of another datatype in one of three ways: 1) by assigning ·constraining facet·s which serve to restrict the ·value space· of the ·user-derived· datatype to a subset of that of the ·base type·; 2) by creating a ·list· datatype whose ·value space· consists of finite-length sequences of values of its ·itemType·; or 3) by creating a ·union· datatype whose ·value space· consists of the union of the ·value space·s of its ·memberTypes·.

2.5.2.1 Derived by restriction

[Definition:]  A datatype is said to be ·derived· by restriction from another datatype when values for zero or more ·constraining facet·s are specified that serve to constrain its ·value space· and/or its ·lexical space· to a subset of those of its ·base type·.

[Definition:]  Every datatype that is ·derived· by restriction is defined in terms of an existing datatype, referred to as its base type. base types can be either ·primitive· or ·derived·.

2.5.2.2 Derived by list

A ·list· datatype can be ·derived· from another datatype (its ·itemType·) by creating a ·value space· that consists of a finite-length sequence of values of its ·itemType·.

2.5.2.3 Derived by union

One datatype can be ·derived· from one or more datatypes by ·union·ing their ·value space·s and, consequently, their ·lexical space·s.

2.5.3 Built-in vs. user-derived datatypes

Conceptually there is no difference between the ·built-in· ·derived· datatypes included in this specification and the ·user-derived· datatypes which will be created by individual schema designers. The ·built-in· ·derived· datatypes are those which are believed to be so common that if they were not defined in this specification many schema designers would end up "reinventing" them. Furthermore, including these ·derived· datatypes in this specification serves to demonstrate the mechanics and utility of the datatype generation facilities of this specification.

Note:  A datatype which is ·built-in· in this specification need not be a "built-in" datatype in any programming language used to implement this specification. Likewise, a datatype which is ·user-derived· in this specification need not be a "user-derived" datatype in any programming language used to implement this specification.

3 Built-in datatypes

Diagram of built-in type hierarchyanyTypeanySimpleTypedurationdateTimetimedategYearMonthgYeargMonthDaygDaygMonthbooleanbase64BinaryhexBinaryfloatdoubleanyURIQNameNOTATIONstringdecimalnormalizedStringintegertokennonPositiveIntegerlongnonNegativeIntegerlanguageNameNMTOKENnegativeIntegerintunsignedLongpositiveIntegerNCNameNMTOKENSshortunsignedIntIDIDREFENTITYbyteunsignedShortIDREFSENTITIESunsignedByteBuilt-in Datatypes

Each built-in datatype in this specification (both ·primitive· and ·derived·) can be uniquely addressed via a URI Reference constructed as follows:

  1. the base URI is the URI of the XML Schema namespace
  2. the fragment identifier is the name of the datatype

For example, to address the int datatype, the URI is:

Additionally, each facet definition element can be uniquely addressed via a URI constructed as follows:

  1. the base URI is the URI of the XML Schema namespace
  2. the fragment identifier is the name of the facet

For example, to address the maxInclusive facet, the URI is:

Additionally, each facet usage in a built-in datatype definition can be uniquely addressed via a URI constructed as follows:

  1. the base URI is the URI of the XML Schema namespace
  2. the fragment identifier is the name of the datatype, followed by a period (".") followed by the name of the facet

For example, to address the usage of the maxInclusive facet in the definition of int, the URI is:

next sub-section3.1 Namespace considerations

The ·built-in· datatypes defined by this specification are designed to be used with the XML Schema definition language as well as other XML specifications. To facilitate usage within the XML Schema definition language, the ·built-in· datatypes in this specification have the namespace name:

  • http://www.w3.org/2001/XMLSchema

To facilitate usage in specifications other than the XML Schema definition language, such as those that do not want to know anything about aspects of the XML Schema definition language other than the datatypes, each ·built-in· datatype is also defined in the namespace whose URI is:

  • http://www.w3.org/2001/XMLSchema-datatypes

This applies to both ·built-in· ·primitive· and ·built-in· ·derived· datatypes.

Each ·user-derived· datatype is also associated with a unique namespace. However, ·user-derived· datatypes do not come from the namespace defined by this specification; rather, they come from the namespace of the schema in which they are defined (see XML Representation of Schemas in [XML Schema Part 1: Structures]).

previous sub-section next sub-section3.2 Primitive datatypes

        3.2.1 string
        3.2.2 boolean
        3.2.3 decimal
        3.2.4 float
        3.2.5 double
        3.2.6 duration
        3.2.7 dateTime
        3.2.8 time
        3.2.9 date
        3.2.10 gYearMonth
        3.2.11 gYear
        3.2.12 gMonthDay
        3.2.13 gDay
        3.2.14 gMonth
        3.2.15 hexBinary
        3.2.16 base64Binary
        3.2.17 anyURI
        3.2.18 QName
        3.2.19 NOTATION

The ·primitive· datatypes defined by this specification are described below. For each datatype, the ·value space· and ·lexical space· are defined, ·constraining facet·s which apply to the datatype are listed and any datatypes ·derived· from this datatype are specified.

·primitive· datatypes can only be added by revisions to this specification.

3.2.1 string

[Definition:]  The string datatype represents character strings in XML. The ·value space· of string is the set of finite-length sequences of characters (as defined in [XML 1.0 (Second Edition)]) that ·match· the Char production from [XML 1.0 (Second Edition)]. A character is an atomic unit of communication; it is not further specified except to note that every character has a corresponding Universal Character Set code point, which is an integer.

Note:  Many human languages have writing systems that require child elements for control of aspects such as bidirectional formating or ruby annotation (see [Ruby] and Section 8.2.4 Overriding the bidirectional algorithm: the BDO element of [HTML 4.01]). Thus, string, as a simple type that can contain only characters but not child elements, is often not suitable for representing text. In such situations, a complex type that allows mixed content should be considered. For more information, see Section 5.5 Any Element, Any Attribute of [XML Schema Language: Part 0 Primer].
Note:  As noted in ordered, the fact that this specification does not specify an ·order-relation· for ·string· does not preclude other applications from treating strings as being ordered.

3.2.3 decimal

[Definition:]  decimal represents a subset of the real numbers, which can be represented by decimal numerals. The ·value space· of decimal is the set of numbers that can be obtained by multiplying an integer by a non-positive power of ten, i.e., expressible as i × 10^-n where i and n are integers and n >= 0. Precision is not reflected in this value space; the number 2.0 is not distinct from the number 2.00. The ·order-relation· on decimal is the order relation on real numbers, restricted to this subset.

Note:  All ·minimally conforming· processors ·must· support decimal numbers with a minimum of 18 decimal digits (i.e., with a ·totalDigits· of 18). However, ·minimally conforming· processors ·may· set an application-defined limit on the maximum number of decimal digits they are prepared to support, in which case that application-defined maximum number ·must· be clearly documented.
3.2.3.2 Canonical representation

The canonical representation for decimal is defined by prohibiting certain options from the Lexical representation (§3.2.3.1). Specifically, the preceding optional "+" sign is prohibited. The decimal point is required. Leading and trailing zeroes are prohibited subject to the following: there must be at least one digit to the right and to the left of the decimal point which may be a zero.

3.2.4 float

[Definition:]  float is patterned after the IEEE single-precision 32-bit floating point type [IEEE 754-1985]. The basic ·value space· of float consists of the values m × 2^e, where m is an integer whose absolute value is less than 2^24, and e is an integer between -149 and 104, inclusive. In addition to the basic ·value space· described above, the ·value space· of float also contains the following three special values: positive and negative infinity and not-a-number (NaN). The ·order-relation· on float is: x < y iff y - x is positive for x and y in the value space. Positive infinity is greater than all other non-NaN values. NaN equals itself but is ·incomparable· with (neither greater than nor less than) any other value in the ·value space·.

Note:  "Equality" in this Recommendation is defined to be "identity" (i.e., values that are identical in the ·value space· are equal and vice versa). Identity must be used for the few operations that are defined in this Recommendation. Applications using any of the datatypes defined in this Recommendation may use different definitions of equality for computational purposes; [IEEE 754-1985]-based computation systems are examples. Nothing in this Recommendation should be construed as requiring that such applications use identity as their equality relationship when computing.

Any value ·incomparable· with the value used for the four bounding facets (·minInclusive·, ·maxInclusive·, ·minExclusive·, and ·maxExclusive·) will be excluded from the resulting restricted ·value space·. In particular, when "NaN" is used as a facet value for a bounding facet, since no other float values are ·comparable· with it, the result is a ·value space· either having NaN as its only member (the inclusive cases) or that is empty (the exclusive cases). If any other value is used for a bounding facet, NaN will be excluded from the resulting restricted ·value space·; to add NaN back in requires union with the NaN-only space.

This datatype differs from that of [IEEE 754-1985] in that there is only one NaN and only one zero. This makes the equality and ordering of values in the data space differ from that of [IEEE 754-1985] only in that for schema purposes NaN = NaN.

A literal in the ·lexical space· representing a decimal number d maps to the normalized value in the ·value space· of float that is closest to d in the sense defined by [Clinger, WD (1990)]; if d is exactly halfway between two such values then the even value is chosen.

3.2.4.1 Lexical representation

float values have a lexical representation consisting of a mantissa followed, optionally, by the character "E" or "e", followed by an exponent. The exponent ·must· be an integer. The mantissa must be a decimal number. The representations for exponent and mantissa must follow the lexical rules for integer and decimal. If the "E" or "e" and the following exponent are omitted, an exponent value of 0 is assumed.

The special values positive and negative infinity and not-a-number have lexical representations INF, -INF and NaN, respectively. Lexical representations for zero may take a positive or negative sign.

For example, -1E4, 1267.43233E12, 12.78e-2, 12 , -0, 0 and INF are all legal literals for float.

3.2.4.2 Canonical representation

The canonical representation for float is defined by prohibiting certain options from the Lexical representation (§3.2.4.1). Specifically, the exponent must be indicated by "E". Leading zeroes and the preceding optional "+" sign are prohibited in the exponent. If the exponent is zero, it must be indicated by "E0". For the mantissa, the preceding optional "+" sign is prohibited and the decimal point is required. Leading and trailing zeroes are prohibited subject to the following: number representations must be normalized such that there is a single digit which is non-zero to the left of the decimal point and at least a single digit to the right of the decimal point unless the value being represented is zero. The canonical representation for zero is 0.0E0.

3.2.5 double

[Definition:]  The double datatype is patterned after the IEEE double-precision 64-bit floating point type [IEEE 754-1985]. The basic ·value space· of double consists of the values m × 2^e, where m is an integer whose absolute value is less than 2^53, and e is an integer between -1075 and 970, inclusive. In addition to the basic ·value space· described above, the ·value space· of double also contains the following three special values: positive and negative infinity and not-a-number (NaN). The ·order-relation· on double is: x < y iff y - x is positive for x and y in the value space. Positive infinity is greater than all other non-NaN values. NaN equals itself but is ·incomparable· with (neither greater than nor less than) any other value in the ·value space·.

Note:  "Equality" in this Recommendation is defined to be "identity" (i.e., values that are identical in the ·value space· are equal and vice versa). Identity must be used for the few operations that are defined in this Recommendation. Applications using any of the datatypes defined in this Recommendation may use different definitions of equality for computational purposes; [IEEE 754-1985]-based computation systems are examples. Nothing in this Recommendation should be construed as requiring that such applications use identity as their equality relationship when computing.

Any value ·incomparable· with the value used for the four bounding facets (·minInclusive·, ·maxInclusive·, ·minExclusive·, and ·maxExclusive·) will be excluded from the resulting restricted ·value space·. In particular, when "NaN" is used as a facet value for a bounding facet, since no other double values are ·comparable· with it, the result is a ·value space· either having NaN as its only member (the inclusive cases) or that is empty (the exclusive cases). If any other value is used for a bounding facet, NaN will be excluded from the resulting restricted ·value space·; to add NaN back in requires union with the NaN-only space.

This datatype differs from that of [IEEE 754-1985] in that there is only one NaN and only one zero. This makes the equality and ordering of values in the data space differ from that of [IEEE 754-1985] only in that for schema purposes NaN = NaN.

A literal in the ·lexical space· representing a decimal number d maps to the normalized value in the ·value space· of double that is closest to d; if d is exactly halfway between two such values then the even value is chosen. This is the best approximation of d ([Clinger, WD (1990)], [Gay, DM (1990)]), which is more accurate than the mapping required by [IEEE 754-1985].

3.2.5.1 Lexical representation

double values have a lexical representation consisting of a mantissa followed, optionally, by the character "E" or "e", followed by an exponent. The exponent ·must· be an integer. The mantissa must be a decimal number. The representations for exponent and mantissa must follow the lexical rules for integer and decimal. If the "E" or "e" and the following exponent are omitted, an exponent value of 0 is assumed.

The special values positive and negative infinity and not-a-number have lexical representations INF, -INF and NaN, respectively. Lexical representations for zero may take a positive or negative sign.

For example, -1E4, 1267.43233E12, 12.78e-2, 12 , -0, 0 and INF are all legal literals for double.

3.2.5.2 Canonical representation

The canonical representation for double is defined by prohibiting certain options from the Lexical representation (§3.2.5.1). Specifically, the exponent must be indicated by "E". Leading zeroes and the preceding optional "+" sign are prohibited in the exponent. If the exponent is zero, it must be indicated by "E0". For the mantissa, the preceding optional "+" sign is prohibited and the decimal point is required. Leading and trailing zeroes are prohibited subject to the following: number representations must be normalized such that there is a single digit which is non-zero to the left of the decimal point and at least a single digit to the right of the decimal point unless the value being represented is zero. The canonical representation for zero is 0.0E0.

3.2.6 duration

[Definition:]   duration represents a duration of time. The ·value space· of duration is a six-dimensional space where the coordinates designate the Gregorian year, month, day, hour, minute, and second components defined in § 5.5.3.2 of [ISO 8601], respectively. These components are ordered in their significance by their order of appearance i.e. as year, month, day, hour, minute, and second.

Note:

All
·minimally conforming· processors ·must· support year values with a minimum of 4 digits (i.e., YYYY) and a minimum fractional second precision of milliseconds or three decimal digits (i.e. s.sss). However, ·minimally conforming· processors ·may· set an application-defined limit on the maximum number of digits they are prepared to support in these two cases, in which case that application-defined maximum number ·must· be clearly documented.
3.2.6.1 Lexical representation

The lexical representation for duration is the [ISO 8601] extended format PnYn MnDTnH nMnS, where nY represents the number of years, nM the number of months, nD the number of days, 'T' is the date/time separator, nH the number of hours, nM the number of minutes and nS the number of seconds. The number of seconds can include decimal digits to arbitrary precision.

The values of the Year, Month, Day, Hour and Minutes components are not restricted but allow an arbitrary unsigned integer, i.e., an integer that conforms to the pattern [0-9]+.. Similarly, the value of the Seconds component allows an arbitrary unsigned decimal. Following [ISO 8601], at least one digit must follow the decimal point if it appears. That is, the value of the Seconds component must conform to the pattern [0-9]+(\.[0-9]+)?. Thus, the lexical representation of duration does not follow the alternative format of § 5.5.3.2.1 of [ISO 8601].

An optional preceding minus sign ('-') is allowed, to indicate a negative duration. If the sign is omitted a positive duration is indicated. See also ISO 8601 Date and Time Formats (§D).

For example, to indicate a duration of 1 year, 2 months, 3 days, 10 hours, and 30 minutes, one would write: P1Y2M3DT10H30M. One could also indicate a duration of minus 120 days as: -P120D.

Reduced precision and truncated representations of this format are allowed provided they conform to the following:

  • If the number of years, months, days, hours, minutes, or seconds in any expression equals zero, the number and its corresponding designator ·may· be omitted. However, at least one number and its designator ·must· be present.
  • The seconds part ·may· have a decimal fraction.
  • The designator 'T' must be absent if and only if all of the time items are absent. The designator 'P' must always be present.

For example, P1347Y, P1347M and P1Y2MT2H are all allowed; P0Y1347M and P0Y1347M0D are allowed. P-1347M is not allowed although -P1347M is allowed. P1Y2MT is not allowed.

3.2.6.2 Order relation on duration

In general, the ·order-relation· on duration is a partial order since there is no determinate relationship between certain durations such as one month (P1M) and 30 days (P30D). The ·order-relation· of two duration values x and y is x < y iff s+x < s+y for each qualified dateTime s in the list below. These values for s cause the greatest deviations in the addition of dateTimes and durations. Addition of durations to time instants is defined in Adding durations to dateTimes (§E).

  • 1696-09-01T00:00:00Z
  • 1697-02-01T00:00:00Z
  • 1903-03-01T00:00:00Z
  • 1903-07-01T00:00:00Z

The following table shows the strongest relationship that can be determined between example durations. The symbol <> means that the order relation is indeterminate. Note that because of leap-seconds, a seconds field can vary from 59 to 60. However, because of the way that addition is defined in Adding durations to dateTimes (§E), they are still totally ordered.

 Relation
P1Y> P364D<> P365D <> P366D< P367D
P1M> P27D<> P28D<> P29D<> P30D<> P31D< P32D
P5M> P149D<> P150D<> P151D<> P152D<> P153D< P154D

Implementations are free to optimize the computation of the ordering relationship. For example, the following table can be used to compare durations of a small number of months against days.

 Months12345678910111213...
DaysMinimum285989120150181212242273303334365393...
Maximum316292123153184215245276306337366397...
3.2.6.3 Facet Comparison for durations

In comparing duration values with minInclusive, minExclusive, maxInclusive and maxExclusive facet values indeterminate comparisons should be considered as "false".

3.2.6.4 Totally ordered durations

Certain derived datatypes of durations can be guaranteed have a total order. For this, they must have fields from only one row in the list below and the time zone must either be required or prohibited.

For example, a datatype could be defined to correspond to the [SQL] datatype Year-Month interval that required a four digit year field and a two digit month field but required all other fields to be unspecified. This datatype could be defined as below and would have a total order.

<simpleType name='SQL-Year-Month-Interval'>
    <restriction base='duration'>
      <pattern value='P\p{Nd}{4}Y\p{Nd}{2}M'/>
    </restriction>
</simpleType>

3.2.7 dateTime

[Definition:]   dateTime values may be viewed as objects with integer-valued year, month, day, hour and minute properties, a decimal-valued second property, and a boolean timezoned property. Each such object also has one decimal-valued method or computed property, timeOnTimeline, whose value is always a decimal number; the values are dimensioned in seconds, the integer 0 is 0001-01-01T00:00:00 and the value of timeOnTimeline for other dateTime values is computed using the Gregorian algorithm as modified for leap-seconds. The timeOnTimeline values form two related "timelines", one for timezoned values and one for non-timezoned values. Each timeline is a copy of the ·value space· of decimal, with integers given units of seconds.

The ·value space· of dateTime is closely related to the dates and times described in ISO 8601. For clarity, the text above specifies a particular origin point for the timeline. It should be noted, however, that schema processors need not expose the timeOnTimeline value to schema users, and there is no requirement that a timeline-based implementation use the particular origin described here in its internal representation. Other interpretations of the ·value space· which lead to the same results (i.e., are isomorphic) are of course acceptable.

All timezoned times are Coordinated Universal Time (UTC, sometimes called "Greenwich Mean Time"). Other timezones indicated in lexical representations are converted to UTC during conversion of literals to values. "Local" or untimezoned times are presumed to be the time in the timezone of some unspecified locality as prescribed by the appropriate legal authority; currently there are no legally prescribed timezones which are durations whose magnitude is greater than 14 hours. The value of each numeric-valued property (other than timeOnTimeline) is limited to the maximum value within the interval determined by the next-higher property. For example, the day value can never be 32, and cannot even be 29 for month 02 and year 2002 (February 2002).

Note:

The date and time datatypes described in this recommendation were inspired by
[ISO 8601]. '0001' is the lexical representation of the year 1 of the Common Era (1 CE, sometimes written "AD 1" or "1 AD"). There is no year 0, and '0000' is not a valid lexical representation. '-0001' is the lexical representation of the year 1 Before Common Era (1 BCE, sometimes written "1 BC").

Those using this (1.0) version of this Recommendation to represent negative years should be aware that the interpretation of lexical representations beginning with a '-' is likely to change in subsequent versions.

[ISO 8601] makes no mention of the year 0; in [ISO 8601:1998 Draft Revision] the form '0000' was disallowed and this recommendation disallows it as well. However, [ISO 8601:2000 Second Edition], which became available just as we were completing version 1.0, allows the form '0000', representing the year 1 BCE. A number of external commentators have also suggested that '0000' be allowed, as the lexical representation for 1 BCE, which is the normal usage in astronomical contexts. It is the intention of the XML Schema Working Group to allow '0000' as a lexical representation in the dateTime, date, gYear, and gYearMonth datatypes in a subsequent version of this Recommendation. '0000' will be the lexical representation of 1 BCE (which is a leap year), '-0001' will become the lexical representation of 2 BCE (not 1 BCE as in this (1.0) version), '-0002' of 3 BCE, etc.

Note: See the conformance note in (§3.2.6) which applies to this datatype as well.
3.2.7.1 Lexical representation

The ·lexical space· of dateTime consists of finite-length sequences of characters of the form: '-'? yyyy '-' mm '-' dd 'T' hh ':' mm ':' ss ('.' s+)? (zzzzzz)?, where

  • '-'? yyyy is a four-or-more digit optionally negative-signed numeral that represents the year; if more than four digits, leading zeros are prohibited, and '0000' is prohibited (see the Note above (§3.2.7); also note that a plus sign is not permitted);
  • the remaining '-'s are separators between parts of the date portion;
  • the first mm is a two-digit numeral that represents the month;
  • dd is a two-digit numeral that represents the day;
  • 'T' is a separator indicating that time-of-day follows;
  • hh is a two-digit numeral that represents the hour; '24' is permitted if the minutes and seconds represented are zero, and the dateTime value so represented is the first instant of the following day (the hour property of a dateTime object in the ·value space· cannot have a value greater than 23);
  • ':' is a separator between parts of the time-of-day portion;
  • the second mm is a two-digit numeral that represents the minute;
  • ss is a two-integer-digit numeral that represents the whole seconds;
  • '.' s+ (if present) represents the fractional seconds;
  • zzzzzz (if present) represents the timezone (as described below).

For example, 2002-10-10T12:00:00-05:00 (noon on 10 October 2002, Central Daylight Savings Time as well as Eastern Standard Time in the U.S.) is 2002-10-10T17:00:00Z, five hours later than 2002-10-10T12:00:00Z.

For further guidance on arithmetic with dateTimes and durations, see Adding durations to dateTimes (§E).

3.2.7.4 Order relation on dateTime

dateTime value objects on either timeline are totally ordered by their timeOnTimeline values; between the two timelines, dateTime value objects are ordered by their timeOnTimeline values when their timeOnTimeline values differ by more than fourteen hours, with those whose difference is a duration of 14 hours or less being ·incomparable·.

In general, the ·order-relation· on dateTime is a partial order since there is no determinate relationship between certain instants. For example, there is no determinate ordering between (a) 2000-01-20T12:00:00 and (b) 2000-01-20T12:00:00Z. Based on timezones currently in use, (c) could vary from 2000-01-20T12:00:00+12:00 to 2000-01-20T12:00:00-13:00. It is, however, possible for this range to expand or contract in the future, based on local laws. Because of this, the following definition uses a somewhat broader range of indeterminate values: +14:00..-14:00.

The following definition uses the notation S[year] to represent the year field of S, S[month] to represent the month field, and so on. The notation (Q & "-14:00") means adding the timezone -14:00 to Q, where Q did not already have a timezone. This is a logical explanation of the process. Actual implementations are free to optimize as long as they produce the same results.

The ordering between two dateTimes P and Q is defined by the following algorithm:

A.Normalize P and Q. That is, if there is a timezone present, but it is not Z, convert it to Z using the addition operation defined in Adding durations to dateTimes (§E)

  • Thus 2000-03-04T23:00:00+03:00 normalizes to 2000-03-04T20:00:00Z

B. If P and Q either both have a time zone or both do not have a time zone, compare P and Q field by field from the year field down to the second field, and return a result as soon as it can be determined. That is:

  1. For each i in {year, month, day, hour, minute, second}
    1. If P[i] and Q[i] are both not specified, continue to the next i
    2. If P[i] is not specified and Q[i] is, or vice versa, stop and return P <> Q
    3. If P[i] < Q[i], stop and return P < Q
    4. If P[i] > Q[i], stop and return P > Q
  2. Stop and return P = Q

C.Otherwise, if P contains a time zone and Q does not, compare as follows:

  1. P < Q if P < (Q with time zone +14:00)
  2. P > Q if P > (Q with time zone -14:00)
  3. P <> Q otherwise, that is, if (Q with time zone +14:00) < P < (Q with time zone -14:00)

D. Otherwise, if P does not contain a time zone and Q does, compare as follows:

  1. P < Q if (P with time zone -14:00) < Q.
  2. P > Q if (P with time zone +14:00) > Q.
  3. P <> Q otherwise, that is, if (P with time zone +14:00) < Q < (P with time zone -14:00)

Examples:

DeterminateIndeterminate
2000-01-15T00:00:00 < 2000-02-15T00:00:002000-01-01T12:00:00 <> 1999-12-31T23:00:00Z
2000-01-15T12:00:00 < 2000-01-16T12:00:00Z2000-01-16T12:00:00 <> 2000-01-16T12:00:00Z
 2000-01-16T00:00:00 <> 2000-01-16T12:00:00Z

3.2.8 time

[Definition:]  time represents an instant of time that recurs every day. The ·value space· of time is the space of time of day values as defined in § 5.3 of [ISO 8601]. Specifically, it is a set of zero-duration daily time instances.

Since the lexical representation allows an optional time zone indicator, time values are partially ordered because it may not be able to determine the order of two values one of which has a time zone and the other does not. The order relation on time values is the Order relation on dateTime (§3.2.7.4) using an arbitrary date. See also Adding durations to dateTimes (§E). Pairs of time values with or without time zone indicators are totally ordered.

Note: See the conformance note in (§3.2.6) which applies to the seconds part of this datatype as well.
3.2.8.1 Lexical representation

The lexical representation for time is the left truncated lexical representation for dateTime: hh:mm:ss.sss with optional following time zone indicator. For example, to indicate 1:20 pm for Eastern Standard Time which is 5 hours behind Coordinated Universal Time (UTC), one would write: 13:20:00-05:00. See also ISO 8601 Date and Time Formats (§D).

3.2.8.2 Canonical representation

The canonical representation for time is defined by prohibiting certain options from the Lexical representation (§3.2.8.1). Specifically, either the time zone must be omitted or, if present, the time zone must be Coordinated Universal Time (UTC) indicated by a "Z". Additionally, the canonical representation for midnight is 00:00:00.

3.2.9 date

[Definition:]   The ·value space· of date consists of top-open intervals of exactly one day in length on the timelines of dateTime, beginning on the beginning moment of each day (in each timezone), i.e. '00:00:00', up to but not including '24:00:00' (which is identical with '00:00:00' of the next day). For nontimezoned values, the top-open intervals disjointly cover the nontimezoned timeline, one per day. For timezoned values, the intervals begin at every minute and therefore overlap.

A "date object" is an object with year, month, and day properties just like those of dateTime objects, plus an optional timezone-valued timezone property. (As with values of dateTime timezones are a special case of durations.) Just as a dateTime obje