XML Query Data Model

W3C Working Draft 15 February 2001

This version:: http://www.w3.org/TR/2001/WD-query-datamodel-20010215
Latest version:: http://www.w3.org/TR/query-datamodel
Previous version:: http://www.w3.org/TR/2000/WD-query-datamodel-20000511
Editors:: Mary Fernandez (AT&T Labs) < mff@research.att.com >
Jonathan Robie (Software AG) <Jonathan.Robie@SoftwareAG-USA.com>

Abstract

This document defines the W3C XML Query Data Model, which is the foundation of the W3C [XML Query Algebra]. Together, these two documents [will] provide a precise semantics for [XML Query Language], the XML Query Language.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This is a Public Working Draft for review by W3C Members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". This is work in progress and does not imply endorsement by the W3C membership.

This document has been produced as part of the [XML Activity], following the procedures set out for the W3C Process. The document has been written by the [XML Query Working Group].

The XML Query Working Group feels that the contents of this Working Draft are reasonably stable, and therefore encourages feedback.

Comments on this document should be sent to the W3C mailing list www-xml-query-comments@w3.org (archived at http://lists.w3.org/Archives/Public/www-xml-query-comments/).

A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.

Revision Description

1 Introduction
2 Concepts
    2.1 Types and signatures
        2.1.1 Schema types
    2.2 Post schema-validated Infoset
3 Nodes
    3.1 Data-model instance
    3.2 Document
    3.3 Elements
    3.4 Attributes
    3.5 Namespaces
    3.6 Processing instructions
    3.7 Comments
    3.8 Reference node
    3.9 Values
    3.10 Informationl items
4 Collections
    4.1 Ordered forests
    4.2 Unordered forests
5 Operators and functions on data-model values
6 Example
7 Constraints on the data model
8 References

Appendices

A XML Information Set mapping
    A.1 Notation and pseudo-code syntax
    A.2 Information items
    A.3 Document item
    A.4 Element items
    A.5 Attribute items
    A.6 Namespace items
    A.7 Processing instruction items
    A.8 Comment items
    A.9 Other information items
B Candidate Operators for XML Query Algebra
    B.1 Serialization operators
    B.2 Order operators
    B.3 Accessors for reference values
C Issues
    C.1 Issues
    C.2 Alphabetic list of issues
        C.2.1 Open Issues
        C.2.2 Resolved (or redundant) Issues

1 Introduction

This document defines the W3C XML Query Data Model, which is the foundation of the W3C [XML Query Algebra]. Together, these two documents [will] provide a precise semantics of the XML Query Language.

Several XML data models have been developed in the W3C Activities. The [XML Information Set] provides a description of the information available in a well-formed XML document. The [XPath Recommendation], which is used by both [XSL Transformations] and [XML Pointer Language (XPointer)], contains a data model and a mapping that relates the XPath data model to the XML Information Set (hence ``Infoset''). The [Document Object Model] is an API for HTML and XML documents, but it does imply an underlying abstract data model. The XML Schema Working Group is defining features, such as structures ([XMLSchema Part 1]) and datatypes ([XMLSchema Part 2]), that extend an instance of the XML Information Set with more precise type information.

The XML Query Data Model defines formally the information contained in the input to an XML Query processor; in other words, an XML Query processor evaluates a query on an instance of the XML Query Data Model. Our model is based on the XML Information Set, but it requires the following new features to meet the XML Query Working Group's requirements:

Support for XML Schema types. ([XMLSchema Part 1], [XMLSchema Part 2])
Representation of Collections of Documents and of Simple and Complex Values. ([XML Query Requirements])
Representation of References. ([XML Query Requirements])

In this document, we provide a precise and formal definition of how values in the XML Query Data Model are constructed and accessed; these operators are the foundation of the XML Query Algebra, and therefore, require a more formal treatment than is provided in the definitions of the XPath and Infoset data models. For comparison, we note wherever the XML Query Data Model differs from that of XPath, and we provide a formal definition of the mapping from the Infoset to the XML Query Data Model in Appendix A.

An instance of the XML Query data model represents one or more complete XML documents or document parts. As with the Infoset, the XML Query Data Model specifies what information in the documents is accessible, but it does not specify the programming-language interfaces or bindings used to represent or access the data. We assume that the information contained in an instance of the XML Query Data Model is static for the duration query evaluation; in particular, the meaning of a query applied to input data that changes during query evaluation is undefined.

We define the data model using a functional notation. We chose this notation because it is simple and permits a precise and concise definition of the data model. The notation is presented in the next section. Although the notation has a functional style, we emphasize that the data model can be realized in a variety of programming languages and styles, for example, as object classes and methods in an object-oriented language.

There is a close correspondence between the data model and the algebraic operations that are applied to instances of the data model. In the definition of the data model, we mention several algebraic operators. These and other candidate operators are identified in the Candidate Operators for XML Query Algebra appendix. The complete algebra will be defined in a future document.

2 Concepts

Because XML documents are tree-structured, we define the XML Query data model using conventional terminology for trees. Various mathematical representations of trees exist (see [CLR]). The graph-theoretic representation models a tree as a tuple (N, E, r), where N is a set of tree nodes, E is a one-to-many relation of edges from parent nodes to children nodes, and r is the distinguished root node of the tree. The tree-constructor representation models a tree by recursive application of a tree-creation function to a list of children trees [Knuth]. In either representation, a tree can have data associated with its nodes (node-labeled), with its edges (edge-labeled), or with both. The XPath and Infoset data models are examples of node-labeled, tree-constructor representations.

The XML Query Data Model (hence ``data model'') is a node-labeled, tree-constructor representation, but also includes a concept of node identity. The addition of node identity simplifies the representation of XML reference values, e.g., IDREF, XPointer, and URI values. Unlike the graph model, the data model does not support both nodes and edges as first-class concepts. We note, however, whenever a graph model may be more flexible or concise.

2.1 Types and signatures

The data model defines the structure of various kinds of tree nodes; functions to construct tree nodes, called constructors; and functions to access nodes' structure, called accessors. This presentation is similar to that in the definition of the XSLT data model [Wadler].

We use the terms node type to refer to the kind of a tree node; schema type to refer to an XML Schema type; and unit type to refer to either a node or schema type. We use the term collection type to refer to ordered forests, unordered forests, and disjoint unions. The term data-model type refers to any node, schema, or collection type. The term signature refers to a constructor's or accessor's function type, i.e., the function's zero or more input types and its one output type.

To make the definition of the data model concise, we use the following notation:

NodeType

SimpleSchemaType

UnitType ::= NodeType the union of node types and schema types

| SimpleSchemaType

DataModelType ::= UnitType a node type or schema type

| [ UnitType ] an ordered forest of values with type UnitType

| { UnitType } an unordered forest of values with type UnitType

| UnitType1 | ... | UnitTypeN the disjoint union of values with types UnitType1,..., UnitTypeN

| TypeName a type name

Both ordered forest and unordered forests may contain duplicates.

The expression TypeName = DataModelType denotes that the string TypeName is a type name that represents the type DataModelType. A type name may appear wherever its associated type is expected.

An occurrence of the type symbol DataModelType denotes a data model value or instance of that type.

The signatures of constructors and accessors use the following notation:

f : (V1, ..., Vm) -> Vn denotes a function f that takes values with types V1, ..., Vm and returns an output value with type Vn

2.1.1 Schema types

Nodes contain values that are instances of schema types. The following symbols denote schema types:

SchemaType ::= SimpleSchemaType

| ComplexSchemaType

A simple type is either primitive (e.g., string, boolean, float, double, ID, IDREF) or derived (e.g., language, NMTOKEN, long, etc., or user defined) [XMLSchema Part 1]. A complex type defines the structure and content of an element [XMLSchema Part 2].

It may be necessary for a query to have access to a value's type during query execution, i.e., at run time. Because an instance of an XML Schema is also an XML document (see XML Representation of Schemas and Schema Components in [XMLSchema Part 1]), a document's schema can be represented as an instance of the data model.

Def_Type = RefNode denotes a reference to the data model representation of schema type Type.

Because Def_Type is a reference to an element value in the data model, i.e., an ElemNode, all of the accessors defined for ElemNode can be applied to the referents of Def_Type values.

2.2 Post schema-validated Infoset

We assume that an instance of the data model is derived from an instance of the XML Information Set [XML Information Set] after XML Schema validation. Such an instance is called a ``PSV Infoset'', for post schema-validated Infoset.

Schema information may be associated with a document in two ways. A document may refer directly to a DTD, and a document may be paired with an XML Schema, which may be derived from one or more XML Schema documents. It is also possible for an XML document to have no DTD or associated schema, i.e., it is ``schemaless''. For a document, schema pair, the PSV Infoset provides the schema. For a document that refers directly to a DTD, we assume a mapping from the DTD to an equivalent schema is provided, and the PSV Infoset provides this equivalent schema. For a schemaless document, a default schema is provided by the data model, and we specify in this document that default schema. In all three cases, a schema can be provided for the input document and the corresponding data-model instance.

3 Nodes

The basic concept in the data model is a Node. A Node is one of nine node types: document, element, value, attribute, namespace (NS), processing instruction (PI), comment, information item, or node reference. This definition specifies that a Node is the disjoint union of nine node types:

Node =  DocNode | ElemNode | ValueNode   | AttrNode
     |  NSNode  | PINode   | CommentNode | InfoItemNode | RefNode

The nine node types are defined in the following subsections. Note that an implementation of the data model in an object-oriented language might choose to make Node an interface (or an abstract class) and to make each node type a concrete class that implements the Node interface (or a sub-class of the abstract class Node).

A Node has nine accessors, each of which returns true if the Node value is of the specified type.

isDocNode      : Node -> boolean
isElemNode     : Node -> boolean
isValueNode    : Node -> boolean
isAttrNode     : Node -> boolean
isNSNode       : Node -> boolean
isPINode       : Node -> boolean
isCommentNode  : Node -> boolean
isInfoItemNode : Node -> boolean
isRefNode      : Node -> boolean

3.1 Data-model instance

An XML document is represented by its distinguished document node, which is a DocNode. A document part is a sub-tree of a document and is represented by an ElemNode, ValueNode, PINode, or CommentNode. An instance of the data model represents one or more complete documents or document parts and may be unordered or ordered. Therefore, a data-model Instance is an ordered or unordered forest of one or more nodes:

Instance = [ DocNode | ElemNode | ValueNode | PINode 
           | CommentNode | RefNode ] 
         | { DocNode | ElemNode | ValueNode | PINode 
            | CommentNode| RefNode  }

3.2 Document

A document is represented by a unique DocNode, which corresponds to a document information item in the Infoset. A DocNode has the constructor docNode, which takes a URI reference and a non-empty ordered forest of its children nodes:

docNode  : (uriReference, [ ElemNode | PINode | CommentNode]) 
         -> DocNode

The ordered forest of children nodes may contain zero or more PINodes and CommentNodes and must contain exactly one ElemNode.

The accessor uri returns a DocNode's URI reference value, and the accessor children returns a DocNode's ordered forest of children:

uri      : DocNode -> uriReference
children : DocNode -> [ ElemNode | PINode | CommentNode ]

The root accessor returns the unique ElemNode in a DocNode's children ordered forest:

root     : DocNode -> ElemNode

3.3 Elements

An ElemNode has two constructors named elemNode. The first constructor takes a tag value, an unordered forest of NSNode namespaces, an unordered forest of AttrNode attributes, an ordered forest of children nodes, and a reference to the node's type:

elemNode : (QNameValue, { NSNode }, { AttrNode }, 
            [ ElemNode | ValueNode | PINode | CommentNode | InfoItemNode | RefNode ], 
            Def_Type)
         -> ElemNode

The second constructor takes an unordered set of NSNode namespaces, an ordered forest of nodes that includes the element's attributes and children nodes, and a reference to the node's type. In the ordered forest of nodes, attributes must precede all other nodes:

elemNode : (QNameValue, { NSNode },  
            [ AttrNode | ElemNode | ValueNode | PINode | CommentNode | InfoItemNode | RefNode ], 
            Def_Type)
         -> ElemNode

An ElemNode's tag is a qualified name value, QNameValue. An ElemNode's set of namespaces contain one namespace node for each distinct namespace that is declared explicitly on the element. The node's children is an ordered forest of ElemNode, ValueNode, PINode, CommentNode, and InfoItemNode values.

We assume that the element is an instance of the type represented by Def_Type, i.e., the document ``type checks'' or is valid with respect to the given schema. For a schemaless document, the data model defines a default schema; in this case, the Def_Type value for all ElemNodes is the ur-type definition [XMLSchema Part 1].

The accessors name, namespaces, attributes, children, and type returns an ElemNode's constituent parts. The nodes accessor returns an ordered forest containing the element's attributes followed by all its children.

name       : ElemNode -> QNameValue
namespaces : ElemNode -> { NSNode }
attributes : ElemNode -> { AttrNode }
children   : ElemNode -> [ ElemNode | ValueNode | PINode 
                         | CommentNode | InfoItemNode | RefNode ] 
nodes      : ElemNode -> [ AttrNode | ElemNode | ValueNode | PINode 
                         | CommentNode | InfoItemNode | RefNode ] 
type       : ElemNode -> Def_Type

The accessor parent returns a the unique parent of an ElemNode or if no parent exists, the NaR, not a reference value. If an ElemNode is the root node of a document, parent will return the corresponding DocNode, unless a document node does not exist, in which case parent returns NaR, the ``not a reference'' value.

parent     : ElemNode -> ElemNode | DocNode | NaR

NOTE: The XPath data model does not model the complex type of a node. This definition adds Def_Type to an element node, and it also adds ValueNode and InfoItemNode as permissible children of an element node.

3.4 Attributes

An AttrNode has the constructor attrNode, which takes the attribute's name and value:

attrNode : (QNameValue, ValueNode) -> AttrNode

The accessors name and value, return an attribute's constituent parts:

name     : AttrNode  -> QNameValue
value    : AttrNode  -> ValueNode

For a schemaless document, the data model defines a default schema; in this case, the type of all AttrNode's values is the simple type

<simpleType base="string"/>.

The accessor parent returns the containing element of an AttrNode, or if no parent exists, the NaR, not a reference value.

parent   : AttrNode -> ElemNode | NaR

NOTE: The XPath data model only permits attribute values to be strings. This definition permits attribute values to be of any simple type.

3.5 Namespaces

An NSNode has the constructor nsNode, which takes an namespace prefix, which may be null, and the absolute URI of the namespace being declared, which may be null:

nsNode : (string | null, uriReference | null) -> NSNode

The accessors prefix and uri return a NSNode's constituent parts:

prefix : NSNode -> string | null
uri    : NSNode -> uriReference | null

A namespace node may contain: a non-null prefix and a non-null uri; a null prefix and a non-null uri; or a null prefix and a null uri. A namespace node may not contain a non-null prefix and a null uri.

The accessor parent returns a the containing element of an NSNode.

parent : NSNode -> ElemNode

3.6 Processing instructions

A PINode has the constructor piNode, which takes a local name that is the processing instruction's target and a string value:

piNode : (string, string) -> PINode

The local name is a string value that must have type NCNAME, i.e., its type value refers to the definition of the derived type NCName :

<simpleType name="NCName" base="Name"/>

The accessors target, value return a PINode's constituent parts.

target : PINode -> string
value  : PINode -> string

The accessor parent returns a PINode's unique parent, or NaR, if it does not exist:

parent : PINode -> ElemNode | DocNode  | NaR

3.7 Comments

A CommentNode has the constructor commentNode, which takes a string value:

commentNode : string -> CommentNode

The accessor value returns a CommentNode's value, and the accessor parent returns a CommentNode's unique parent, or NaR, if it does not exist.

value    : CommentNode -> string
parent   : CommentNode -> ElemNode | DocNode | NaR

3.8 Reference node

The data model provides reference nodes as a mechanism to encapsulate the identity of nodes in a given instance of the data model. A RefNode has the constructor refNode, which takes a a Node.

refNode    : Node    -> RefNode

The accessor deref returns the the node referred to by a reference node.

deref      : RefNode -> Node

The accessor parent returns a the unique parent of a RefNode or if no parent exists, the NaR, not a reference value.

parent     : RefNode -> ElemNode | NaR

The function ref is surjective, i.e., it is onto. The function deref is the inverse of ref, i.e., for all nodes n, n == deref(ref(n)) is true, where == is an implementation-dependent equality operator over node identity.

The mechanism for implementing a reference node is implementation dependent, for example, a reference node might be represented by a key value, an object identifier, an XPointer value, etc.

Some accessors map values (e.g., IDREF or key values) into reference values. If a key value does not represent a valid reference, the ``not a reference'' value, NaR, may be used.

3.9 Values

A ValueNode is the disjoint union of fourteen kinds of simple-type values; each value kind has an associated constructor.

ValueNode = StringValue  | BoolValue     | FloatValue    | DoubleValue   
          | DecimalValue | TimeDurValue  | RecurDurValue | BinaryValue 
          | URIRefValue  | IDValue       | IDREFValue    | QNameValue    
          | ENTITYValue  | NOTATIONValue

The data model provides constructors for the fourteen primitive XML Schema datatypes:

stringValue  : (string, Def_string, [InfoItemNode]) -> StringValue
boolValue    : (boolean, Def_boolean)                    -> BoolValue
floatValue   : (float, Def_float)                        -> FloatValue
doubleValue  : (double, Def_double)                      -> DoubleValue
decimalValue : (decimal, Def_decimal)                    -> DecimalValue
timeDurValue : (timeDuration, Def_TimeDuration)          -> TimeDurValue
recurDurValue: (recurringDuration, Def_recurringDuration)-> RecurDurValue
binaryValue  : (binary, Def_binary)                      -> BinaryValue
urirefValue  : (uriReference, Def_uriReference)          -> URIRefValue
idValue      : (ID, Def_ID)                              -> IDValue
idrefValue   : (IDREF, Def_IDREF)                        -> IDREFValue
qnameValue   : (uriReference | null, NCName, Def_QName)  -> QNameValue
entityValue  : (ENTITY, Def_ENTITY)                      -> ENTITYValue
notationValue: (NOTATION, Def_NOTATION)                  -> NOTATIONValue

Ed. Note: We note here that the data model does not currently represent key values and key reference values as described in XML Schema Part 1 : Structures [XMLSchema Part 1]. In a future draft of this document, keys and key references will be represented in the data model.

[A XML Information Set mapping] specifies how ValueNodes are constructed from information items. The data model assumes that the character information items for an atomic value (e.g., string, integer, floating-point number) are not interleaved with other information items (e.g., PIs or comments). If the character information items for an atomic value are interleaved with other information items (e.g., PIs or comments), those information items are not preserved in the data model.

Derived datatypes are modeled as instances of their primitive base type. In particular, wherever an instance of Def_S is expected, we assume an instance of S' is permitted if and only if S' is derived from simple type S. For example, the derived simple type long has base type integer, so an instance of long can be modeled by an IntegerValue whose type is Ref(Def_long), i.e., it refers to the definition of the derived type long:

<simpleType name="long" base="integer"/>

The explicit enumeration of each kind of value guarantees that the type of the value's instance and its type representation correspond. For example, a StringValue always contains a string value and a reference to the representation of a primitive or derived type whose base type is string. For example, a product's ``Sku'' number could be represented by the string value:

StringValue("926-AA", Ref(Def_Sku), [])

where the Sku type is derived from string:

<simpleType name="Sku" base="string">
  <pattern value="\d{3}-[A-Z]{2}"/>
</simpleType>

The accessor type returns a ValueNode's type:

type      : ValueNode   -> Def_S

The accessor parent returns a ValueNode's unique parent, or NaR, if it does not exist:

parent : ValueNode -> ElemNode | DocNode  | NaR

The accessor string returns a ValueNode's string representation:

string    : ValueNode   -> StringValue

The accessor infoItems returns the ordered forest of information items associated with a StringValue; this ordered forest may contain the character information items from which the string value was created:

infoItems : StringValue -> [ InfoItemNode ]

The accessors localname and namespace return a QNameValue's constituent parts:

localname : QNameValue  -> NCName
namespace : QNameValue  -> uriReference | null

The accessor referent returns an ordered forest of element nodes associated with an IDREFValue or URIRefValue:

referent  : (IDREFValue | URIRefValue) -> [ ElemNode | NaR ]

If a referent is not in the data-model instance, referent returns NaR, the ``not a reference'' value. In a data-model instance of a PSV infoset, every IDREF and keyref value is guaranteed to refer to a ElemNode in the data-model instance. This may not be the case for a URI value, which could refer to an element in an arbitrary document that is not contained in the data-model instance. In this case, referent may return an ordered forest containing NaR.

A ValueNode has fifteen additional accessors, each of which returns true if the ValueNode is of the specified type.

isStringValue   : ValueNode -> boolean 
isBoolValue     : ValueNode -> boolean 
isFloatValue    : ValueNode -> boolean 
isDoubleValue   : ValueNode -> boolean 
isDecimalValue  : ValueNode -> boolean 
isTimeDurValue  : ValueNode -> boolean 
isRecurDurValue : ValueNode -> boolean 
isBinaryValue   : ValueNode -> boolean 
isURIRefValue   : ValueNode -> boolean 
isIDValue       : ValueNode -> boolean 
isIDREFValue    : ValueNode -> boolean 
isQNameValue    : ValueNode -> boolean 
isENTITYValue   : ValueNode -> boolean 
isNOTATIONValue : ValueNode -> boolean

NOTE: ValueNode replaces the XPath data model's text node.

3.10 Informationl items

An InfoItemNode has the constructor infoItemNode:

infoItemNode : InfoItem -> InfoItemNode

An InfoItemNode is an opaque value that preserves those information items that are not of interest to the data model or query language, but are required by the underlying Infoset implementation. It has no accessors.

4 Collections

In addition to Nodes, the data model supports two kinds of collections: ordered forests and unordered forests. It is convenient to think of ordered forests as lists and unordered forests as bags. Unlike conventional lists and bags, however, collections are ``flat'', i.e., collections may not be nested in other collections. Collections may contain duplicate values.

4.1 Ordered forests

The constructor [] constructs the empty ordered forest. The constructor cons creates a new ordered forest containing the unit value (i.e., a Node or SimpleSchemaType) of its first argument followed by the values in its second argument. The append constructor creates a new ordered forest containing the values in the its second argument appended to the values in its first argument.

[]     : [ UnitType ]
cons   : (UnitType, [ UnitType ] ) -> [ UnitType ]
append : ([ UnitType ], [ UnitType ]) -> [ UnitType ]

For convenience, the expression [ v ], where v is a unit value, denotes cons(v, []), and [ l ] denotes l, where l is an ordered forest, that is, [] is the identity function on ordered forests.

For example,

[ [ 1 ] ] = [ 1 ]
append(append([ 1 ], [ 2 ]), [ 3 ]) = [ 1, 2, 3 ]

An ordered forest has three accessors. The empty accessor returns true if its argument is the ordered forest list and false otherwise. The head accessor returns the first value in a non-empty ordered forest. The tail accessor returns all items in a non-empty ordered forest excluding its first item.

empty : [ UnitType ] -> boolean
head  : [ UnitType ] -> UnitType
tail  : [ UnitType ] -> [ UnitType ]

The value equality operator = on ordered forests is defined as follows:

fun equal(l1, l2) { 
  if empty(l1) and empty(l2) then 
    true
  else if not empty(l1) and not empty(l2) then {
    if head(l1) = head(l2) then 
      equal(tail(l1), tail(l2))
    else false
  } else false
}

4.2 Unordered forests

The constructor {} constructs an empty unordered forest. The constructor add creates a new unordered forest containing the unit value of its first argument and the values in its second argument.

{}  : { UnitType } 
add : (UnitType, { UnitType } ) -> { UnitType }

An unordered forest has two accessors. The empty accessor returns true if its argument is the empty unordered forest and false otherwise. The some accessor returns any value in a non-empty unordered forest; it is a non-deterministic function.

empty : { UnitType } -> boolean
some  : { UnitType } -> UnitType

For convenience, the expression { v }, where v is a unit value, denotes add(v, {}), and { b } denotes b, where b is an unordered forest.

The function union creates a new unordered forest containing the disjoint union of the values in each of its arguments. The function diff creates a new unordered forest containing the values that occur in its first argument, but not in its second argument. The function intersect creates a new unordered forest containing the values that occur in both its first and second arguments. The three functions use the identity equality operator '==' to compare values.

union     : ({ UnitType }, { UnitType }) -> { UnitType }
diff      : ({ UnitType }, { UnitType }) -> { UnitType }
intersect : ({ UnitType }, { UnitType }) -> { UnitType }

5 Operators and functions on data-model values

In addition to constructors and accessors, the data model provides several basic functions on data model values.

The data model includes two infix equality operators on all DataModelTypes: = and ==. The = operator denotes equality of node values, and the == operator denotes equality of node identities.

=   : (DataModelType, DataModelType) -> boolean
==  : (DataModelType, DataModelType) -> boolean

The data model includes one infix inequality operator on NodeTypes: <. The < operator compares the document order of two nodes in a document. Document (or global) order refers to the total order among all elements in a document. Global order is defined as the order of appearance of the element nodes when performing a pre-order, depth-first traveral of a the document. This corresponds to the order of appearance of their opening tags in the XML serialization and is equivalent to the definition used in [XPath Recommendation]. The < operator returns true if the first node is before (and different from) the second node in document order. It returns false if the first node is equal to or after the second node in document order. Its result is undefined if the two nodes are in different documents.

<   : (NodeType, NodeType) -> boolean

It is often useful to make ordered forests unordered and unordered forests ordered. The listtobag function takes any ordered forest and returns an unordered forest. The bagtolist function takes any unordered forest and returns an ordered forest whose nodes are sorted by document order (e.g., by using the < operator defined above).

listtobag : [ UnitType ] -> { UnitType }
bagtolist : { UnitType } -> [ UnitType ]

The copy function takes any data-model value and returns a copy of that value. For Node values, a ``deep'' copy of the node and all of its descendent nodes is returned. For collections, a new collection is constructed that contains a copy of every value in the collection.

copy : DataModelType -> DataModelType

6 Example

We use the following XML document to illustrate the information contained in an instance of the data model:

<?xml version=1.0?>
<p:part xmlns:p="http://www.mywebsite.com/PartSchema"
      xsi:schemaLocation = "http://www.mywebsite.com/PartSchema
                            http://www.mywebsite.com/PartSchema"
      name="nutbolt">
  <mfg>Acme</mfg>
  <price>10.50</price>
</p:part>

The XML schema for this document is:

<xsd:schema xmlns:xsd="http://www.w3.org/1999/XMLSchema"
  targetNamespace="http://www.mywebsite.com/PartSchema">
  <xsd:element name="part" type="part_type">
    <xsd:complexType name="part_type">
      <xsd:element name = "mfg" type="xsd:string"/>
      <xsd:element name = "price" type="xsd:decimal"/>
      <xsd:attribute name = "name" type="xsd:string"/>
    </xsd:complexType>
  </xsd:element>
</xsd:schema>

For this example, we chose an XML document that includes an XML Schema to illustrate the relationship between document content and its associated schema type information. In general, an XML Schema is not required, that is, the data model can represent a schemaless, well-formed XML document with the default typing described in this document.

The XML document is represented by the data-model accessors below. The value D1 represents a DocNode; the values E1, E2, etc. represent ElemNodes; the values A1, etc. represent AttrNodes; the values N1, etc. represent NSNodes.

children(D1)   = [ E1 ]
root(D1)       = E1
                                       
name(E1)       = QNameValue("http://www.mywebsite.com/PartSchema", 
                            "part", Ref(Def_QName))
children(E1)   = [ E2, E3 ]  
attributes(E1) = { A1 }           
namespaces(E1) = { N1 }           
type(E1)       = Ref(Def_part_type)
parent(E1)     = D1

name(A1)       = QNameValue(null, "name", Ref(Def_QName))
value(A1)      = StringValue("nutbolt", Ref(Def_string))
parent(A1)     = E1
                                  
prefix(N1)     = StringValue("p", Ref(Def_string))
uri(N1)        = URIRefValue("http://www.mywebsite.com/PartSchema", 
                             Ref(Def_uriReference))
parent(N1)     = E1

name(E2)       = QNameValue(null, "mfg", Ref(Def_QName))
children(E2)   = [ StringValue("Acme", Ref(Def_string)) ]
attributes(E2) = {}
namespaces(E2) = {}
type(E2)       = Ref(Def_string)
parent(E2)     = E1
				       
name(E3)       = QNameValue(null, "price", Ref(Def_QName))
children(E3)   = [ DecimalValue(10.50, Ref(Def_decimal)) ]
attributes(E3) = {}
namespaces(E3) = {}
type(E3)       = Ref(Def_decimal)
parent(E3)     = E1

A graphical depiction of the data-model instance, containing the information described in the text above, is also included.

Graphical depiction of data model instance.

Recall that an XML schema is also an XML document, which can be represented in the data model. The complex and simple types in the example schema are represented below:

name(Def_part_type)       = QNameValue("http://www.w3.org/1999/XMLSchema", 
                                       "complexType", Ref(Def_QName))
children(Def_part_type)   = [ E4, E5, E6 ]
attributes(Def_part_type) = { A2 }
namespaces(Def_part_type) = {}
type(Def_part_type)       = Ref(Def_complexType)

name(A2)       = QNameValue(null, "name", Ref(Def_QName))
value(A2)      = StringValue("part_type", Ref(Def_NCName), [])

name(E4)       = QNameValue("http://www.w3.org/1999/XMLSchema", 
                            "element", Ref(Def_QName))
children(E4)   = []
attributes(E4) = { A3, A6 }
namespaces(E4) = {}
type(E4)       = Ref(Def_element)

name(A3)       = QNameValue(null, "name", Ref(Def_QName))  
value(A3)      = StringValue("mfg", Ref(Def_NCName), [])

name(A6)       = QNameValue(null, "type", Ref(Def_QName))       
value(A6)      = QNameValue("http://www.w3.org/1999/XMLSchema", 
                                "string", Ref(Def_QName), [])

name(E5)       = QNameValue("http://www.w3.org/1999/XMLSchema", 
                            "element", Ref(Def_QName))    
children(E5)   = []
attributes(E5) = { A4, A7 }
namespaces(E5) = {}
type(E5)       = Ref(Def_element)

name(A4)       = QNameValue(null, "name", Ref(Def_QName))
value(A4)      = StringValue("price", Ref(Def_NCName))

name(A7)       = QNameValue(null, "type", Ref(Def_QName))  
value(A7)      = QNameValue("http://www.w3.org/1999/XMLSchema", 
                                "decimal", Ref(Def_QName), [])

name(E6)       = QNameValue("http://www.w3.org/1999/XMLSchema", 
                            "attribute", Ref(Def_QName))  
children(E6)   = []
attributes(E6) = { A5, A8 }
namespaces(E6) = {}
type(E6)       = Ref(Def_attribute)

name(A5)       = QNameValue(null, "name", Ref(Def_QName))       
value(A5)      = StringValue("name", Ref(Def_NCName), [])

name(A8)       = QNameValue(null, "type", Ref(Def_QName))       
value(A8)      = QNameValue("http://www.w3.org/1999/XMLSchema", 
                                "string", Ref(Def_QName), [])

For conciseness, we eliminate the definitions of builtin schema components, e.g., complexType, and of primitive simple types, e.g., string.

7 Constraints on the data model

A data-model instance must satisfy the following constraints:

Node references: We assume that the representation of a node reference is defined by the query system that implements the data model, not by the query language itself. Node references are not serialized, i.e., they exist only for use by the query system, and they are not guaranteed to be globally unique or persistent, although some implementation of the data model may choose to support persistent node references. Multiple techniques may exist for implementing references, and there may be multiple techniques for implementing identity.
Node identity: the function ref is one-to-one and onto, i.e., n1 == n2 holds if and only if n1 and n2 are the same node. We assume that the operator == is an implementation-dependent equality operator over node references.
Unique parent: the parent accessor is a many-to-one function, i.e., a node has exactly one parent.
Parent-child relationships: Given two ElemNodes p and n, parent(n) == p holds if and only if n is in children(p).
Similarly, given a DocNode d and a node n, parent(n) == d if and only if n is in children(d).
Finally, given a AttrNode (or NSNode) a and a node n, parent(a) == n if and only if a is in attributes(n) (namespaces(n)).
Duplicate-free list of children: Given a node n and any two nodes r1 and r2 at distinct positions in children(n), not(r1 == r2) must hold i.e., the ordered list of children nodes is duplicate free.
Element constructor : in the ordered forest of nodes that includes the element's attributes and children nodes, attributes must precede all other nodes.
Value nodes: [A XML Information Set mapping] specifies how ValueNodes are constructed from information items. The data model assumes that the character information items for an atomic value (e.g., string, integer, floating-point number) are not interleaved with other information items (e.g., PIs or comments). If the character information items for an atomic value are interleaved with other information items (e.g., PIs or comments), those information items are not preserved in the data model.

8 References

CLR: T. Cormen, C. Leiserson, R. Rivest, Introduction to Algorithms, MIT Press, 1991.
Document Object Model: World Wide Web Consortium, Document Object Model. http://www.w3.org/TR/DOM-Level-2-Core/
Knuth: D. Knuth, The Art of Computer Programming, Vol. 1, Addison-Wesley, 1973.
Wadler: P. Wadler, A formal semantics of patterns in XSLT. Available at: http://www.cs.bell-labs.com/who/wadler/papers/xsl-semantics/xsl-semantics.pdf
Canonical XML: World Wide Web Consortium, Canonical XML. http://www.w3.org/TR/xml-c14n.
XML Activity: World Wide Web Consortium, XML Activity. http://www.w3.org/XML/
XML Information Set: World Wide Web Consortium, XML Information Set (Infoset). Available at: http://www.w3.org/TR/xml-infoset
XPath Recommendation: World Wide Web Consortium, XML Path Language (XPath). Available at: http://www.w3.org/TR/xpath.html
XML Pointer Language (XPointer): World Wide Web Consortium, XML Pointer Language (XPointer). Available at: http://www.w3.org/TR/xptr.
XSL Transformations: World Wide Web Consortium, XML Path Language (XPath). http://www.w3.org/TR/xslt
XMLSchema Part 1: World Wide Web Consortium, XML Schema Part 1 : Structures. Available at: http://www.w3.org/TR/xmlschema-1
XMLSchema Part 2: World Wide Web Consortium, XML Schema Part 2 : Datatypes. Available at: http://www.w3.org/TR/xmlschema-2
XML Query Algebra: World Wide Web Consortium, XML Query Algebra. Available at: http://www.w3.org/TR/query-algebra/
XML Query Working Group: World Wide Web Consortium, XML Query Working Group. http://www.w3.org/XML/Activity#query-wg
XML Query Language: World Wide Web Consortium, XML Query Language. Available at: http://www.w3.org/TR/xquery/
XML Query Requirements: World Wide Web Consortium, XML Query Requirements. Available at: http://www.w3.org/TR/2000/WD-xmlquery-req-20000131

A XML Information Set mapping

The XML Information Set (http://www.w3.org/TR/xml-infoset) provides a description of the information available in a well-formed XML document. In this appendix, we define the Infoset in the same notation as the data model, and we define functions that map from an instance of the Infoset to an instance of the data model in a pseudo-code notation.

Ed. Note: In a future draft of this document, a mapping from the XML Query data model into the Infoset will be provided.

A.1 Notation and pseudo-code syntax

We name accessors of the PSV Infoset using the convention psv_<item-name>_<property> . For example, psv_elem_attributes is the accessor that returns an element item's attributes property. The comments in the accessor definitions specify whether a property is core or peripheral.

The following operations are defined on lists and are used in the function definitions:

[] constructs the empty list.
[n] constructs a list with one element n.
h :: t constructs a list with head element h and tail list t.

It is often necessary to deconstruct a list value, for example, into its head element and tail list. We use the same notation to match and deconstruct list values. For example, this function maps all "a" values in a list to "b" values by deconstructing and matching the argument list L:

 
fun a2b(L) {
 case L of []           => []
        |  [ "a" ] :: t => [ "b" ] :: (a2b t)
        |  h :: t       => h :: (a2b t)
}
a2b [ "a", "c", "a" ] = [ "b", "c", "b" ]

The case expression matches its list argument L against each list pattern and evaluates the right-hand-side of the first pattern that matches L.

The function concat_string concatenates two strings:

concat_string : (string, string) -> string

For example,

list_append [ 1, 2 ] [ 3, 4 ] = [ 1, 2, 3, 4 ]
concat_string "First" "Last" = "FirstLast"

The function list_flatten takes a list of lists and returns the flattened list:

list_flatten : [ [ U ] ] -> [ U ]

For example,

list_flatten([ [1], [2, 3] ]) = [ 1, 2, 3 ]

The function set_map applies a function to every member of a set and returns the transformed set; list_map is analogous for lists:

set_map  : ((U1 -> U2), { U1 }) -> { U2 }
list_map : ((U1 -> U2), [ U1 ]) -> [ U2 ]

For example, given the function addOne:

fun add_one(x) { return x + 1 }
set_map  add_one { 1, 2 } = { 2, 3 }
list_map add_one [ 1, 2 ] = [ 2, 3 ]

A.2 Information items

The basic concept in the Infoset is an InfoItem. An InfoItem is one of ten item types:

InfoItem = DocItem      | ElemItem    | CDATAItem   | AttrItem     
         | NSItem       | PIItem      | CommentItem | SkipEntityItem 
         | EntityMarker | CDATAMarker

No constructors are defined for Item values, because they are constructed by an Infoset processor, not the data model.

A.3 Document item

The following accessors are defined on a document information item:

/* core */
psv_doc_children : DocItem -> [ /* core */ 
                                PIItem | ElemItem 
                                /* peripheral */ 
                              | CommentItem | DocTypeItem ]
psv_doc_notations : DocItem-> { NotationItem }
psv_doc_base_uri  : DocItem -> uriReference
 
/* core : unparsed entities; peripheral : parsed entities */
psv_doc_entities  : DocItem -> { EntityItem }

Exactly one ElemItem is required in the psv_doc_children list and it is the root element of the document tree.

The following pseudo-code documents the mapping from an instance of the PSV infoset to an instance of the data model. The function dm_docNode maps a document information item (DocItem) to a document node (DocNode):

dm_docNode : DocItem -> DocNode
fun dm_docNode(d) { 
  kids = list_map dm_node (psv_doc_children d)
  return docNode(uriValue(psv_doc_base_uri d, ref(Def_uriReference)), kids)
}

In the definition of kids, list_map applies the function dm_node to each child of the DocItem value d; this constructs a new list of children nodes, each of which has type Node. The application of the constructor docNode constructs the document node in the data model.

A.4 Element items

The following accessors are defined on element information items:

/* core */
psv_elem_uri       : ElemItem -> uriReference
psv_elem_local_name: ElemItem -> string 

psv_elem_children  : ElemItem -> [ /* core */
                                   PIItem
                                 | ElemItem
                                 | SkipEntityItem
                                 | CDATAItem
  
                                   /* peripheral */
                                 | CommentItem
                                   /* start, end marker pairs : */
                                 | (EntityMarker, EntityMarker)
                                 | (CDATAMarker, CDATAMarker)
                                 ]
/* core */
psv_elem_attrs       : ElemItem -> { AttrItem } 
psv_elem_decl_ns     : ElemItem -> { NSItem } /* declared namespaces */
psv_elem_base_uri    : ElemItem -> uriReference
psv_elem_parent      : ElemItem -> (ElemItem | DocItem)

/* Infoset item for schema type */
psv_elem_schema_type : ElemItem -> ElemItem

/* peripheral */
psv_elem_inscope_ns  : ElemItem -> { NSItem } /* in-scope namespaces */

The function dm_elemNode maps an ElemItem to an ElemNode:

dm_elemNode : ElemItem -> ElemNode
fun dm_elemNode(e) { 
  name      = qnameValue(psv_elem_uri e, psv_elem_local_name e, Def_QName))
  nsnodes   = set_map dm_nsNode (psv_elem_decl_ns e)
  attrnodes = set_map dm_attrNode (psv_elem_attrs e)
  kids      = dm_collapse_strings(list_map dm_node (psv_elem_children e))
  type      = dm_schema_type(psv_elem_schema_type e)
  newkids   = if (isSimpleType(type)) (list_map (dm_coerce_string type) kids) 
              else kids 
  return elemNode(name, nsnodes, attrnodes, newkids, type)
}

This function extracts components from the ElemItem, transforms them into instances of data-model values, and then constructs an ElemNode value.

The function dm_schema_type returns a reference to the Def_Type value that corresponds to the schema type represented by its information-item argument.

dm_schema_type   : ElemItem -> Def_Type

We assume above that if an element's type is a simple type, its list of children must contain one string value. The function dm_coerce_string coerces its string-valued second argument into an instance of the simple type referenced by its first argument.

dm_coerce_string : (Def_S, StringValue) -> S

Ed. Note: This mapping does not capture an element item's keyref values, which are maintained in the element item's identity constraints table. The identity constraints table is created for ``bookkeeping purposes'' by an XML Schema processor (see [XMLSchema Part 1]). The mapping of an element item's identity constraints table into keyref values will be specified in a future draft of this document.

The function dm_node maps an information item to a reference to a data-model node. All information items are preserved in the data model, but the document type, skipped entity reference, entity markers, and character data markers are represented by the opaque data type InfoItemNode.

dm_node : InfoItem -> Node
fun dm_node(i) { 
  case i of 
    PIItem e            => dm_piNode e
  | ElemItem e        => dm_elemNode e
  | CommentItem e     => dm_commNode e
  | CharDataItem e  => dm_char2string e
  | r as DocTypeItem     => infoItemNode r
  | r as SkipEntityItem  => infoItemNode r
  | p as (EntityMarker, EntityMarker) => infoItemNode p
  | p as (CDATAMarker,  CDATAMarker)  => infoItemNode p
}

The function dm_char2string takes one CDATA item and maps it to a StringValue with a string value of length one. Note that the StringValue constructor preserves the original CDATA item in a InfoItemNode.

dm_char2string : CDATAItem -> StringValue
fun dm_char2string(c) {
  /* convert (psv_cdata_code c) to string of length 1 */
  stringValue(code2string(psv_cdata_code c), Def_string), 
              [infoItemNode (ref c)])
}

The function dm_collapse_strings synthesizes a single string value from multiple CDATA items. It does this by collapsing one or more consecutive StringValues in its argument list, each of which must have the simple schema type string, into one StringValue whose content is the concatenation of the values of the consecutive StringValues. All other node values are returned unchanged. This function assumes that the character information items for an atomic value (e.g., string, integer, floating-point number) are not interleaved with other information items (e.g., PIs or comments). If the character information items for an atomic value are interleaved with other information items (e.g., PIs or comments), those information items are not preserved in the data model.

dm_collapse_strings : [ Node ] -> [ Node ] 
fun dm_collapse_strings(nodelist) {
  case nodelist of : 
    []     => return [] 
  | [ n ]  => return [ n ]
  | h1 as StringValue(s1, Def_string, i1)) :: t1 =>
    { case t1 of :
        []   =>   return [ h1 ] 
        /* Collapse two consecutive strings and 
           apply dm_collapse_strings recursively */
      | StringValue(s2, Def_string, i2)) :: t2 =>
          return dm_collapse_strings(
                   ref(stringValue(concat_string(s1, s2), 
                                   Def_string), 
                                   list_append(i1, i2))) :: t2)
      | h :: t => return (h :: (dm_collapse_strings t))
    }
}

A.5 Attribute items

The following accessors are defined for attribute information items:

/* core */
psv_attr_children     : AttrItem -> [ CDATAItem ]
psv_attr_uri          : AttrItem -> uriReference
psv_attr_local_name   : AttrItem -> string 
psv_attr_owner_elem   : AttrItem -> ElemItem

/* Infoset item for schema type */
psv_attr_schema_type  : AttrItem -> ElemItem

/* peripheral */
psv_attr_type         : AttrItem -> string
psv_attr_spec_flag    : AttrItem -> SpecifiedFlag
psv_attr_default      : AttrItem -> [ CDATAItem ] 
psv_attr_start_entity : AttrItem -> EntityMarker
psv_attr_end_entity   : AttrItem -> EntityMarker

The function dm_attrNode maps an AttrItem to an AttrNode:

dm_attrNode : AttrItem -> AttrNode 
fun dm_attrNode(a) {
   name = qnameValue(psv_attr_uri a, psv_attr_local_name a, Def_QName))
   /* An attribute value should be one string */
   value = head(dm_collapse_strings(list_map dm_node (psv_attr_children a)))
   type = dm_schema_type(psv_attr_schema_type a)
   /* convert value to appropriate type */
   return attrNode(name, (dm_coerce_string type value))
}

A.6 Namespace items

The following accessors are defined for namespaces:

/* core */ 
psv_ns_prefix        : NSItem -> string
psv_ns_uri           : NSItem -> uriReference
psv_ns_value         : NSItem -> [ CDATAItem | EntityMarker ]
psv_ns_owner_elem    : NSItem -> ElemItem

The function dm_nsNode maps an NSItem to an NSNode:

dm_nsNode : NSItem -> NSNode 
fun dm_nsNode(i) {
   /* what happens to NS node's CDATA children? */
   return nsNode(stringValue(psv_ns_prefix i, Def_string), []), 
                 urirefValue(psv_ns_uri i, Def_uriReference)
}

A.7 Processing instruction items

The following accessors are defined on processing-instruction information items:

/* core */
psv_pi_target        : PIItem -> string
psv_pi_value         : PIItem -> string
psv_pi_base_uri      : PIItem -> uriReference
psv_pi_parent        : PIItem 
                     -> ElemItem | DocItem | DocTypeItem

The function dm_piNode maps an PIItem to an PINode:

dm_piNode : PIItem -> PINode 
fun dm_piNode(i) {
  return piNode(stringValue(psv_pi_target i, Def_NCName, []),
                stringValue(psv_pi_value i, Def_string, []))
}

A.8 Comment items

The following accessor is defined on comment information items:

/* core */
psv_comm_value        : CommentItem -> string
psv_comm_parent       : CommentItem 
                      -> ElemItem | DocItem | DocTypeItem

The function dm_commNode maps an CommentItem to an CommentNode:

dm_commNode : CommentItem -> CommentNode 
fun dm_commNode(i) {
  return  commNode(stringValue(psv_comm_value i, Def_string, []))
}

A.9 Other information items

The data model represents document type, skipped entity reference, entity markers, and character data markers as instances of the opaque data type InfoItemNode. For completeness, the following accessors are defined for these information items.

/* Skip Entity Item : core */ 
psv_skip_entity_item : SkipEntityItem -> string
psv_skip_entity_ref  : SkipEntityItem -> EntityItem
psv_skip_entity_parent : SkipEntityItem -> ElemItem

/* CDATA Item : core */
psv_cdata_code       : CDATAItem -> Code
psv_cdata_parent     : CDATAItem -> ElemItem
/* CDATA Item : peripheral */ 
psv_cdata_whitespace : CDATAItem -> boolean
psv_cdata_predefined : CDATAItem -> boolean

/* Document Type Item :  peripheral */
psv_doctype_entity   : DocTypeItem -> EntityItem
psv_doctype_children : DocTypeItem -> [ CommentItem | PIItem ]
psv_doctype_parent   : DocTypeItem -> DocItem

/* Entity Item : core */
psv_entity_name      : EntityItem -> string
psv_entity_sysid     : EntityItem -> string
psv_entity_pubid     : EntityItem -> string
psv_entity_notat     : EntityItem -> NotationItem
psv_entity_base_uri  : EntityItem -> uriReference
 
/* Entity Item : peripheral */
psv_entity_type      : EntityItem -> string
psv_entity_content   : EntityItem -> string
psv_entity_encoding  : EntityItem -> string
psv_entity_standalone: EntityItem -> string /* "yes", "no", "notpresent" */

/* Entity Marker : core */
psv_entity_marker_entity : EntityMarker -> EntityItem
psv_entity_marker_parent : DocTypeItem 
                         -> ElemItem | AttrItem | NSItem

/* Notation Item : core */
psv_notation_name    : NotationItem -> string
psv_notation_sysid   : NotationItem -> string
psv_notation_pubid   : NotationItem -> string
psv_notation_base_uri: NotationItem -> uriReference

/* CDATA Marker : core */
psv_cdata_marker_parent : CDataMarker 
                        -> ElemItem | AttrItem | NSItem

B Candidate Operators for XML Query Algebra

There is a close correspondence between the data model and the algebraic operations that are applied to instances of the data model. In this appendix, we identify those algebraic operators that are candidate operators for the algebra.

B.1 Serialization operators

In the XPath data model, every kind of node has an associated string value, i.e., a string serialization of the node's value. The data model already provides an operator for serializing a ValueNode as a string. The algebra also may include one or more operators to serialize a Node as a string:

serialize : Node -> StringValue

One serialization operator may emit a node value in [Canonical XML]. Other operators may choose other serializations.

B.2 Order operators

Given an instance of the data model, an ordering relation on its nodes can be defined. For example, document order, i.e., the absolute order of nodes in an input document, is a property that a user of the XML Query Language may want to query.

Given a single input document, a global, total order can be defined on nodes. In general, however, it may not be possible to define a global order, for example when a data model instance contains nodes from multiple input documents. Therefore, it may be useful and necessary to have multiple ordering relations, e.g., one that is global and one that is partial:

globalOrder  : (Instance, Node) -> IntegerValue
partialOrder : (Instance, Node) -> IntegerValue

Order operators must be defined over a particular instance of the data model. The functions return an integer order value given an instance of the data model and a node contained in that instance.

B.3 Accessors for reference values

Although reference values are often contained in attribute nodes, they typically represent element-to-element relationships. For convenience, the algebra may define additional accessor functions that produce the references associated with a particular element.

For example, the idrefs function takes an ElemNode and returns a list of references to element nodes that are referenced by IDREF values in the element's attribute and children components; this accessor is defined in terms of existing accessors:

idrefs : ElemNode -> [ ElemNode ]
fun idrefs(n) {
  list_append(
    list_flatten [ referent(v) | a <- list(attributes(n)), v = value(a), 
                isIDREFValue(v) ],
    list_flatten [ referent(c) | c <- children(n), isIDREFValue(c) ])
   )
}

The function is defined using a list comprehension, which is based on a familiar notation for sets. The second expression above:

[ referent(c) | c <- children(n), isIDREFValue(c) ]

is read as follows: return the list of the referent values of each node c such that c is a child of n and c is a IDREFValue. Since referent produces a list, the result of this expression is a list of lists: list_flatten takes a list of lists and returns the flattened list, and list_append concatenates the two lists into one. The attributes accessor returns an unordered, not an ordered, forest so the list function converts (non-deterministically) an ordered forest to an ordered one.

The links accessor is analogous to idrefs, but returns non-null URI values that occur in the element's attributes and children components; it is defined as:

links : ElemNode -> [ ElemNode ]
fun links(n) {
  list_append(
    list_flatten [ referent(v) | a <- list(attributes(n)), v = value(a), 
                isURIRefValue(v), not(referent(v) = NaR) ],
    list_flatten [ referent(v) | c <- children(n), isURIRefValue(c),
                not(referent(c) = NaR)])
}

C Issues

The issues in [C Issues] serve as a design history for this document. The ordering of issues is irrelevant. Each issue has a unique id of the form Issue-<dddd> (where d is a digit). This can be used for referring to the issue by <url-of-this-document>#Issue-<dddd>. Furthermore, each issue has a mnemonic header, a date, an optional description, and an optional resolution. For convenience, resolved issues are displayed in green. Some of the issues contain references to W3C internal archives. These are marked with "W3C-members only". Some of the descriptions of the resolved issues are obsolete w.r.t. to the current version of the document.

C.1 Issues

Issue-0001: PSV Infoset identity constraints

Date: Oct-2000

What should be data-model representation, if any, of PSV Infoset identity-constraint tables?

Issue-0002: Representation of atomic values

Date: Oct-2000

This function assumes that the character information items for an atomic value (e.g., string, integer, floating-point number) are not interleaved with other information items (e.g., PIs or comments). The treatment of such interleaved values is not handled in this definition. This issue is addressed in threads beginning at: http://lists.w3.org/Archives/Member/w3c-archive/2000Jun/0090.html (W3C-members only) and http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0079.html (W3C-members only).

MF: The data model does not preserve information items interleaved with the character info items of an atomic value.

Issue-0003: Example parent

Date: Oct-2000

Remark Michael: An IDREF cannot point to an empty string.

Issue-0004: Schema/DTD

Date: Oct-2000

A document may refer to a DTD and have an associated schema.

Issue-0005: Lists of Simple Values

Date: Oct-2000

The current data model draft takes only into account singleton value-nodes. It must represent lists of simple-type values as well. See http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000May/0060.html.(W3C-members only)

Peter suggests having a special-purpose kind of a ValueNode that represents lists of simple types. An advantage of this approach is that the constraint that lists of simple types be homogeneous/monomorphic can be enforced. However, lists/forests already can be modeled in current data model, without adding more complexity. For example, an attribute's value could be modeled as a list of ValueNodes:

value : AttrNode -> [ ValueNode ]

A disadvantage of this approach is that the monomorphism constraint on lists derived from simple types is not enforced. However, given a type system for Query, such a constraint could be enforced. So Mary is in favor of not having a special-purpose kind of ValueNode to represent lists, but instead model them by forests directly in the data model.

Issue-0006: Collections

Date: Oct-2000

We need a more thorough definition of collections, perhaps in a separate section, which includes bags and defines collections formally.

In particular, the algebra (probably) will not support arbitrarily nested collections (i.e., lists of lists, sets of sets, etc.). We need to specify how collections are constructed. For example, in the data model, the basic collection is a forest, i.e., a list of Nodes. The forest constructor creates a singleton forest from one Node; or it creates a forest from two forests by concatenating the two forests:

Forest = Node | [ Node ]

forest : Node -> Forest
fun forest(Node n) =  [ n ]

union : (Forest, Forest) -> Forest
fun union(f1, f2) = list_append f1 f2

bagunion : (NodeBag, NodeBag) -> NodeBag
setunion : (NodeSet, NodeSet) -> NodeSet

Similar constructors would exist for unordered forests with and without duplicates.

unordered : Forest -> NodeBag unique : NodeBag -> NodeSet set = unique o unordered : Forest -> NodeSet

Added section on Collections [4 Collections].

Issue-0007: ValueNodes

Date: Oct-2000

An alternative representation is to have a single ValueNode whose base type is string:

valueNode : (string, Def_S, [ InfoItemNode]) -> ValueNode

This representation is more closely aligned with other node types in the data model, but it makes the simple type of leaf-node values opaque.

Peter Fankhauser compares and constrasts these options in : http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Apr/0174.html (W3C-members only)

Issue-0008: Node vs edge centric data model

Date: Oct-2000

Cite:

Let me summarize my issues with a node-centric datamodel right at the beginning. The first two are mentioned in the doc later on:

As long as (1) the data represents a tree, (2) easy bi-directional is not required, (3) projection/extension operations with object-preserving semantics are not required, a node-centric datamodel is isomorphic to an edge-centric datamodel and is easier to represent and understand.

As soon as anyone of the above requirements change, an edge model has several advantages:

data represents a graph: naming the edges (relationships) becomes a must, since the names are now on the relationships and not on the objects. Uniform treatment of all edges (even the so far anonymous containment edges) makes defining operations easier since they are more orthogonal. With the possibility of distinguishing "type" from "name", even subelement names now semantically represent relationship names. For example, ShipAddr and BillAddr in
<Order> <ShipAddr dt:dt="Address">...</ShipAddr> <BillAddr dt:dt="Address">...</BillAddr> </Order>
are denoting relationships (ownership to be exact) from the order element to the Address elements.
As soon as backwards pointers are introduced into a node-centric model, the representation becomes more complex and less elegant. Transforming data becomes more complex since the backwards pointer becomes part of the object state. Thus, if I define views where an element changes the parent, in the edge-centric case, this just adds a new relationship, the object state is unchanged, in the node-centric approach, I need to express now two parents in the object state.
Projection/extension operations. Assume that I pose a query that projects name and address but hides the age of a person element. In the edge-centric approach, this means that the query logically transforms the graph context on which the query operates by removing the age edge from the context without touching the object state (the objects keeps its basetype), in the node-centric approach, the object state needs to change since the context transformation will remove the attribute property age. While both operations transform the context, I find the former to be more elegant than the later.

MF: To align with XPath and the Algebra, the data model is node centric.

Issue-0009: Schema info

Date: Oct-2000

Cite: Sometimes one wants to use different schemata over the same basic XML fragment. So I would rather start with that in principle, the data model is schemaless and can provide the data model of any XML fragment given a schema. Thus, the schema postprocessing becomes a datamodel transformation that we make explicit (and that could be optimized with other operations that transform the datamodel graph).

Issue-0010: Node identity

Date: Oct-2000

Should the data model require that an implementation guarantee that the identity of a node is always preserved?

MF: The data model always preserves node identity; the only operator that does not preserve node identity is copy.

Issue-0011: Access to facets

Date: Oct-2000

In XML Schema, facets such as ``nullable'' is associated with an element declaration, which is a element name, complex type pair. If the query language needs access to such facets, we may need to replace Def_Type by a reference to the element declaration.

Issue-0012: Representation of reference values

Date: Oct-2000

Cite: The current representation of reference values is too much IDREF(S) centric. I would prefer a more general representation for XLink and the schema (and potentially graph operation) introduced reference mechanisms.

Issue-0013: Equality operators on collections

Date: 17-Jan-2001

Equality operators '=' on collections are not defined.

Issue-0014: Elements with unordered children

Date: 17-Jan-2001

Should the element constructor elemNode also permit unordered forests of children?

Issue-0015: Semantics of value equality operator '='

Date: 02-Feb-2001

The semantics of the value equality operator '=' is undefined

C.2 Alphabetic list of issues

C.2.1 Open Issues

Number: 11

[Issue-0011: Access to facets]
[Issue-0014: Elements with unordered children]
[Issue-0013: Equality operators on collections]
[Issue-0003: Example parent ]
[Issue-0005: Lists of Simple Values]
[Issue-0001: PSV Infoset identity constraints]
[Issue-0012: Representation of reference values]
[Issue-0004: Schema/DTD]
[Issue-0009: Schema info]
[Issue-0015: Semantics of value equality operator '=']
[Issue-0007: ValueNodes]

C.2.2 Resolved (or redundant) Issues

Number: 4

[Issue-0006: Collections]
[Issue-0010: Node identity]
[Issue-0008: Node vs edge centric data model]
[Issue-0002: Representation of atomic values ]

NodeType
SimpleSchemaType
UnitType	::=	NodeType	the union of node types and schema types
	\|	SimpleSchemaType
DataModelType	::=	UnitType	a node type or schema type
	\|	[ UnitType ]	an ordered forest of values with type UnitType
	\|	{ UnitType }	an unordered forest of values with type UnitType
	\|	UnitType1 \| ... \| UnitTypeN	the disjoint union of values with types UnitType1,..., UnitTypeN
	\|	TypeName	a type name

XML Query Data Model

W3C Working Draft 15 February 2001

Abstract

Status of this document

Revision Description

Table of contents

Appendices

1 Introduction

2 Concepts

2.1 Types and signatures

2.1.1 Schema types

2.2 Post schema-validated Infoset

3 Nodes

3.1 Data-model instance

3.2 Document

3.3 Elements

3.4 Attributes

3.5 Namespaces

3.6 Processing instructions

3.7 Comments

3.8 Reference node

3.9 Values

3.10 Informationl items

4 Collections

4.1 Ordered forests

4.2 Unordered forests

5 Operators and functions on data-model values

6 Example

7 Constraints on the data model

8 References

A XML Information Set mapping

A.1 Notation and pseudo-code syntax

A.2 Information items

A.3 Document item

A.4 Element items

A.5 Attribute items

A.6 Namespace items

A.7 Processing instruction items

A.8 Comment items

A.9 Other information items

B Candidate Operators for XML Query Algebra

B.1 Serialization operators

B.2 Order operators

B.3 Accessors for reference values

C Issues

C.1 Issues

C.2 Alphabetic list of issues

C.2.1 Open Issues

C.2.2 Resolved (or redundant) Issues