W3C

The XML Query Algebra

W3C Working Draft 15 February 2001

This version:
http://www.w3.org/TR/2001/WD-query-algebra-20010215
Latest version:
http://www.w3.org/TR/query-algebra/
Previous version:
http://www.w3.org/TR/2000/WD-query-algebra-20001204
Editors:
Peter Fankhauser (GMD-IPSI) <fankhaus@darmstadt.gmd.de>
Mary Fernández (AT&T Labs - Research) <mff@research.att.com>
Ashok Malhotra (IBM) <petsa@us.ibm.com>
Michael Rys (Microsoft) <mrys@microsoft.com>
Jérôme Siméon (Bell Labs, Lucent Technologies) <simeon@research.bell-labs.com>
Philip Wadler (Avaya) <wadler@avaya.com>

Abstract

This document introduces the XML Query Algebra (``the Algebra'') as a formal basis for an XML query language.

Status of this document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. The latest status of this document series is maintained at the W3C.

This is a Public Working Draft for review by W3C Members and other interested parties. It is a draft document and may be updated, replaced or made obsolete by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". This is work in progress and does not imply endorsement by the W3C membership.

This document has been produced as part of the W3C XML Activity, following the procedures set out for the W3C Process. The document has been written by the XML Query Working Group.

The purpose of this document is to present the current state of the XML Query Algebra and to elicit feedback on its current state. The XML Query Working Group feels that it has made good progress on this document but that it is subject to change in future versions. Comments on this document should be sent to the W3C mailing list www-xml-query-comments@w3.org (archived at http://lists.w3.org/Archives/Public/www-xml-query-comments/). Important issues remain open - see [B.3.1 Open Issues]. In particular, the reader should note the following issues related to compatibility of the Algebra with related XML activities.

The XML Query Working Group is working diligently to achieve compatibility with these XML activities.

A list of current W3C Recommendations and other technical documents can be found at http://www.w3.org/TR/.

Table of contents

1 Introduction
2 The Algebra by Example
    2.1 Data and types
    2.2 Projection
    2.3 Atomic data
    2.4 Iteration
    2.5 Selection
    2.6 Quantification
    2.7 Join
    2.8 Unordered forests
    2.9 Parent and cast operators
    2.10 References and node identity
    2.11 Restructuring and grouping
    2.12 Querying order
    2.13 Sorting
    2.14 Aggregation
    2.15 Expanded names
    2.16 Comments and processing instructions
    2.17 Mixed Content
    2.18 Functions
    2.19 Structural recursion
    2.20 Functions for all well-formed documents
    2.21 Top-level queries and XML results
3 Algebra Components
    3.1 Expanded names
    3.2 Wildcards
    3.3 Atomic types
    3.4 Types : abstract syntax
    3.5 Types : concrete syntax
    3.6 Expressions
    3.7 Operators
    3.8 Built-in functions
4 Static Semantics : Type-Inference Rules
    4.1 Relating data to types
    4.2 Equivalences and subtyping
    4.3 Environments
    4.4 Expressions
    4.5 Operators
    4.6 Built-in functions
    4.7 Typing unordered forests
    4.8 Iteration expressions
    4.9 Match expressions
    4.10 Top-level expressions
5 Dynamic Semantics : Value-Inference Rules
    5.1 Semantic objects
    5.2 Environments
    5.3 Expressions
    5.4 Operators
    5.5 Built-in functions
    5.6 Iteration expressions
    5.7 Match expressions
    5.8 Top-level expressions
6 References

Appendices

A Equivalences
    A.1 Relating projection to iteration
    A.2 Reducible expressions
    A.3 Laws
B Issues
    B.1 Introduction
    B.2 Issues list
    B.3 Alphabetic list of issues
        B.3.1 Open Issues
        B.3.2 Resolved (or redundant) Issues

1 Introduction

This document proposes an algebra for XML Query (``the Algebra''). In this document, ``query-analysis time'' refers to when an Algebra expression is parsed and type checked, that is before the value of the expression is computed, and ``query-evaluation time'' refers to when an Algebra expression is evaluated, that is, when it is reduced to a value. We sometimes use the phrases query-analysis time and ``compile time'' interchangeably as well as the phrases query-evaluation time and ``run time''.

This work builds on long-standing traditions in the database community. In particular, we have been inspired by systems such as SQL, OQL, and nested relational algebra (NRA). We have also been inspired by systems such as Quilt, UnQL, XDuce, XML-QL, XPath, XQL, XSLT, and YaTL. We give citations for all these systems below.

In the database world, it is common to translate a query language into an algebra; this happens in SQL, OQL, and NRA, among others. The purpose of the algebra is twofold. First, an algebra is used to give a semantics for the query language, so the operations of an algebra should be well-defined. Second, an algebra is used to support query optimization, so it should possess a rich set of laws. The Algebra is powerful enough to capture the semantics of several XML query languages, such as XML-QL, XQL, and XQuery. The laws we give include analogues of most of the laws of relational algebra.

It is also common for a query language to exploit schemas or types; this happens in SQL, OQL, and NRA, among others. The purpose of types is twofold. Types can be used to detect certain kinds of errors at query-analysis time and to support query optimization. Given the type of input values, a query applied to those values, and the expected type of the query's output value, the Algebra's type system can detect at query-analysis time if the query's output value has the expected output type.

DTDs and XML Schema can be thought of as providing something like types for XML. The XML Query algebra uses a simple type system that captures the essence of [XML Schema Part 1]. The type system is close to that used in XDuce [HP2000]. On this basis, the XML Query algebra is statically typed. This allows an implementation of the Algebra to determine and check at query-analysis time the output type of a query on documents conforming to an input type. Compare this to an untyped or dynamically typed query language, where each individual output has to be validated against a schema at query-evaluation time, and there is no guarantee that this check will always succeed.

To define the Algebra completely, we present a static semantics and a dynamic semantics. The static semantics is presented as type inference rules, which relate Algebra expressions to types and and specify under what conditions an expression is well typed. The semantics is static, because ill-typed expressions are identified at query-analysis time, i.e., before the query is evaluated. The dynamic, or operational, semantics is presented as value inference rules, which relate Algebra expressions to values. The Algebra's values are defined in the [XML Query Data Model]; for example, they include XML atomic values, attributes, and elements, as well as other values. A dynamic semantics guarantees that every expression can be reduced to a value and may serve as the basis for a query interpreter or compiler.

The document is organized as follows. A tutorial introduction is presented in [2 The Algebra by Example]. After reading the tutorial, the reader will have seen the entire algebra and should be able to write example queries. The grammar for the Algebra's types and expressions are given in [3 Algebra Components]. We give the static typing rules for the Algebra in [4 Static Semantics : Type-Inference Rules] and then the dynamic semantics for the Algebra in [5 Dynamic Semantics : Value-Inference Rules]. These sections formalize the information presented informally in [2 The Algebra by Example]. Although these two sections contain the most challenging material, we have tried to make the content as accessible as possible. Readers only interested in learning about the Algebra need not read these sections, however, we expect that implementors of the Algebra will read them.

In [B.2 Issues list], we discuss open issues and problems. We present some equivalence and optimization laws of the Algebra in [A Equivalences].

Cited literature includes: monads [Mog89], [Mog91], [Wad92], [Wad93], [Wad95], NRA [BNTW95], [Col90], [LW97], [LMW96], OQL [BK93], [BKD90], [CM93], Quilt [Quilt], SQL [Date97], UnQL [BFS00], XDuce [HP2000], XMSL-QL [XMLQL99], XPath [XPath], XQL [XQL99], XSLT [XSLT 99], and YaTL [YAT99].

2 The Algebra by Example

This section introduces the main features of the Algebra, using examples based on accessing a database of books.

2.1 Data and types

Consider the following sample data:

    <bib>
      <book year="1999" isbn="1-55860-622-X">
        <title>Data on the Web</title>
        <author>Abiteboul</author>
        <author>Buneman</author>
        <author>Suciu</author>
      </book>
      <book year="2001" isbn="1-XXXXX-YYY-Z">
        <title>XML Query</title>
        <author>Fernandez</author>
        <author>Suciu</author>
      </book>
    </bib>

Here is a fragment of an XML Schema for such data:

    <xsd:group name="Bib">
      <xsd:element name="bib">
        <xsd:complexType>
          <xsd:group ref="Book"
                     minOccurs="0" maxOccurs="unbounded"/>
        </xsd:complexType>
      </xsd:element>
    </xsd:group>
    
    <xsd:group name="Book">
      <xsd:element name="book">
        <xsd:complexType>
          <xsd:attribute name="year" type="xsd:integer"/>
          <xsd:attribute name="isbn" type="xsd:string"/>
          <xsd:element name="title" type="xsd:string"/>
          <xsd:element name="author" type="xsd:string" maxOccurs="unbounded"/>
        </xsd:complexType>
      </xsd:element>
    </xsd:group>

This data and schema is represented in the Algebra as follows:

    type Bib =
      bib [ Book{0, *} ]
    type Book =
      book [
        @year  [ Integer ] &
        @isbn  [ String ],
        title  [ String ], 
        author [ String ]{1, *}
      ]
    let bib0 : Bib =
      bib [
        book [
          @year  [ 1999 ], 
          @isbn  [ "1-55860-622-X" ], 
          title  [ "Data on the Web" ], 
          author [ "Abiteboul" ], 
          author [ "Buneman" ], 
          author [ "Suciu" ]
        ], 
        book [
          @year  [ 2001 ], 
          @isbn  [ "1-XXXXX-YYY-Z" ], 
          title  [ "XML Query" ], 
          author [ "Fernandez" ], 
          author [ "Suciu" ]
        ]
      ] 

The expression above defines two types, Bib and Book, and defines one global variable, bib0.

The Bib type corresponds to a single bib element, which contains a forest of zero or more Book elements. We use the term forest to refer to a sequence of (zero or more) attributes and elements. Every attribute or element can be viewed as a forest of length one.

The Book type corresponds to a single book element, which which also contains a forest of zero or more attributes and elements. It contains one year attribute and one isbn attribute, followed by one title element, followed by one or more author elements. The & operator, called the interleave operator, indicates that the year and isbn attributes may occur in any order. A isbn attribute and a title or author element contains a string value, and a year attribute contains an integer.

The let expression above binds the variable bib0 to a literal XML value, which is the data model representation of the earlier XML document. The variable bib0 is globally available, i.e., it may appear in any other expression in the query. A global variable is defined once. It is an error to redefine a global variable. The Algebra also provides a let expression for binding a local variable. It is introduced in [2.6 Quantification]. The value of a global or local variable is immutable, that is, once a variable is defined, its value does not change. The value of bib0 is a bib element that contains two book elements.

The Algebra is a strongly typed language, therefore the value of bib0 must be an instance of its declared type, or the expression is ill-typed. Here the value of bib0 is an instance of the Bib type, because it contains one bib element, which contains two book elements, each of which contain an integer-valued year attribute, a string-valued isbn attribute, a string-valued title element, and one or more string-valued author elements.

For convenience, we define a second global variable book0 also bound to a literal value, which is equal to the first book in bib0.

    let book0 : Book = 
        book [
          @year  [ 1999 ], 
          @isbn  [ "1-55860-622-X" ], 
          title  [ "Data on the Web" ], 
          author [ "Abiteboul" ], 
          author [ "Buneman" ], 
          author [ "Suciu" ]
        ]

2.2 Projection

One of the Algebra's most basic operations is projection. The Algebra uses a notation similar in syntax and meaning to path navigation in XPath. The following expression returns all author elements contained in book elements contained in bib0:

    bib0/book/author
==> author [ "Abiteboul" ], 
    author [ "Buneman" ], 
    author [ "Suciu" ], 
    author [ "Fernandez" ], 
    author [ "Suciu" ]
:   author [ String ] {0, *}
 

Note that in the result, the document order of author elements is preserved.

The above example and the ones that follow have three parts. First is an expression in the Algebra. Second, following the ==> is the value of this expression. Third, following the : is the type of the expression, which is (of course) also a legal type for the value.

It may be unclear why the type of bib0/book/author contains zero or more authors, even though the type of a book element contains one or more authors. Let's look at the derivation of the result type by looking at the type of each sub-expression:

    bib0             : Bib
    bib0/book        : Book{0, *}
    bib0/book/author : author [ String ]{0, *}
 

Recall that Bib, the type of bib0, may contain zero or more Book elements, therefore the expression bib0/book might contain zero book elements, in which case, bib0/book/author would contain no authors.

This illustrates an important feature of the type system: the type of an expression depends only on the type of its sub-expressions. It also illustrates the difference between an expression's value at query-evaluation time and its type at query-analysis time. Since the type of bib0 is Bib, the best type for bib0/book/author is one listing zero or more authors, even though for the given value of bib0, the expression will always contain exactly five authors.

Its also possible to project on attributes. This expression produces the year attribute of book0 whose type is @year [ String ].

    book0/@year
==> @year [ 1999 ]
:   @year [ String ]
 

2.3 Atomic data

One may access atomic data (strings, integers, or booleans) using the keyword data(). For instance, if we wish to select all author names in a book, rather than all author elements, we could write the following.

    book0/author/data()
==> "Abiteboul",
    "Buneman",
    "Suciu"
:   String {1, *}

Similarly, it is possible to project the atomic values of attributes. The following returns the year the book was published.

    book0/@year/data()
==> 1999
:   Integer

This notation is similar to the use of text() in XPath. We chose the keyword data() because, as the second example shows, not all data items are strings.

2.4 Iteration

Another common operation is to iterate over elements in a document so that their content can be transformed into new content. Here is an example of how to process each book to list the author before the title, and remove the year and isbn.

    for b in bib0/book do
      book [ b/author, b/title ]
==> book [
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ], 
      title  [ "Data on the Web" ]
    ], 
    book [
      author [ "Fernandez" ], 
      author [ "Suciu" ], 
      title  [ "XML Query" ]
    ]
:   book [
      author[ String ]{1, *}, 
      title[ String ]
    ]{0, *} 

The for expression iterates over all book elements in bib0, and binds the variable b to each such element. For each element bound to b, the inner expression constructs a new book element containing the book's authors followed by its title. The transformed elements appear in the same order as they occur in bib0.

In the result type, a book element is guaranteed to contain one or more authors followed by one title. Let's look at the derivation of the result type to see why:

    bib0/book        : Book{0, *}
    b                : Book
    b/author         : author [ String ]{1, *}
    b/title          : title  [ String ]

The type system can determine that b is always Book, therefore the type of b/author is author[ String ]{1, *}, and the type of b/title is title[ String ].

In general, the value of a for expression is a forest, i.e., zero or more data-model values as defined in [XML Query Data Model]. If the body of the for expression itself yields a forest, then all of the forests are concatenated together. For instance, the expression:

    for b in bib0/book do
      b/author

is exactly equivalent to the expression bib0/book/author.

Here we have explained the typing of for expressions by example. In fact, the typing rules are rather subtle and will be explained in detail in [4 Static Semantics : Type-Inference Rules].

2.5 Selection

To select values that satisfy some predicate, we use the where expression. For example, the following expression selects all book elements in bib0 that were published before 2000.

    for b in bib0/book do 
      where b/@year/data() <= 2000 do
        b
==> book [
     @year  [ 1999 ], 
     @isbn  [ "1-55860-622-X" ], 
     title  [ "Data on the Web" ], 
     author [ "Abiteboul" ], 
     author [ "Buneman" ], 
     author [ "Suciu" ]
    ] 
:   Book{0, *} 

In general, an expression of the form:

where e1 do e2

is converted to the form

if e1 then e2 else ()

where e1 and e2 are expressions. Here () is an expression that stands for the empty sequence, a forest that contains no attributes or elements. We also write () for the type of the empty sequence.

According to this rule, the expression above translates to

for b in bib0/book do
  if b/@year/data() <= 2000 then b else () 

and this has the same value and the same type as the preceding expression.

2.6 Quantification

The following expression selects all book elements in bib0 that have some author named "Buneman".

    for b in bib0/book do
      for a in b/author/data() do
        where a = "Buneman" do
          b
==> book [
     @year  [ 1999 ], 
     @isbn  [ "1-55860-622-X" ], 
     title  [ "Data on the Web" ], 
     author [ "Abiteboul" ], 
     author [ "Buneman" ], 
     author [ "Suciu" ]
    ] 
:   Book{0, *}

In contrast, we can use the empty operator to find all books that have no author whose name is Buneman:

    for b in bib0/book do
      where empty(for a in b/author do
                    where a/data() = "Buneman" do
                      a) do
        b
==> book [
      @year  [ 2001 ], 
      @isbn  [ "1-XXXXX-YYY-Z" ], 
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

The empty expression checks that its argument is the empty sequence ().

We can also use the empty operator to find all books where all the authors are Buneman, by checking that there are no authors that are not Buneman:

    for b in bib0/book do
      where empty(for a in b/author do
                    where a/data() != "Buneman" do
                      a) do
        b
==> ()
:   Book{0, *}

There are no such books, so the result is the empty sequence. Appropriate use of empty (possibly combined with not) can express universally or existentially quantified expressions.

Here is a good place to introduce the let expression, which binds a local variable to a value. Introducing local variables may improve readability. For example, the following expression is exactly equivalent to the previous one.

    for b in bib0/book do
      let nonbunemans = (for a in b/author do
                           where a/data() != "Buneman" do
                             a) do
        where empty(nonbunemans) do
          b

Local variables can also be used to avoid repetition when the same subexpression appears more than once in a query.

The scope of a local variable v is the body of the let expression that follows the keyword do up to any nested let expression that redefines v.

2.7 Join

Another common operation is to join values from one or more documents. To illustrate joins, we give a second data source that defines book reviews:

    type Reviews  =
      reviews [
        book [ 
          title  [ String ], 
          review [ String ]
        ]{0, *}
      ]
    let review0 : Reviews =
      reviews [
        book [
          title  [ "XML Query" ], 
          review [ "A darn fine book." ] 
        ], 
        book [
          title  [ "Data on the Web" ], 
          review [ "This is great!" ] 
        ]
      ] 

The Reviews type contains one reviews element, which contains zero or more book elements; each book contains a title and a review.

We can use nested for expressions to join the two sources review0 and bib0 on title values. The result combines the title, authors, and reviews for each book.

    for b in bib0/book do
      for r in review0/book do
        where b/title/data() = r/title/data() do
          book [ b/title, b/author, r/review ]
==> 
    book [
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
      review [ "A darn fine book." ] 
    ], 
    book [
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
      review [ "This is great!" ] 
    ]
:   book [
      title [ String ], 
      author [ String ] {1, *}, 
      review [ String ]
    ] {0, *}

Note that the outer-most for expression determines the order of the result. Readers familiar with optimization of relational join queries know that relational joins commute, i.e., they can be evaluated in any order. This is not true for the Algebra: changing the order of the first two for expressions would produce different output. In [2.8 Unordered forests], we introduce support for unordered forests, which permits commutable joins.

It is beyond the scope of this document to describe algorithms for evaluating nested loop joins. See [Graefe93] for a survey.

2.8 Unordered forests

As discussed in [2.7 Join] joins do not commute on ordered forests. In databases, ordering often does not matter. To permit commutable joins, and to allow for other query optimization techniques, the Algebra also supports unordered forests. At type level, unordered forests are indicated by {Type}. This type specifies that the order of book elements in the reviews element is insignificant:

    type Reviews1 =
         reviews [{ 
           book [
                title  [String],
                review [String]
              ]{0,*} 
         }]

To transform an ordered forest to an unordered forest, we use the listtobag operator:

    let review1 : Reviews1 =
      reviews [listtobag(review0/*)]
==> reviews [{
      book [
        title  [ "XML Query" ],
        review [ "A darn fine book." ]
      ],
      book [
        title  [ "Data on the Web" ],
        review [ "This is great!" ]
      ]
    }]: Reviews1

This query transforms the ordered forest returned by review0/* into an unordered forest. The order of nodes in the forest is now insignificant, thus review1 could also be bound to:

    let review1 : Reviews1 =
      reviews [listtobag(review0/*)]
==> reviews [{
      book [
        title  [ "Data on the Web" ],
        review [ "This is great!" ]
      ],
      book [
        title  [ "XML Query" ],
        review [ "A darn fine book." ]
      ]
    }]: Reviews1

Likewise, we can transform the ordered forest of books in bib0 to an unordered forest. Now we can perform a commutable join:

    for b in listtobag(bib0/book) do
      for r in review1/book do
         where b/title/data() = r/title/data() do
            book [b/title, b/author, r/review]
==> {book [
       title  [ "Data on the Web" ],
       author [ "Abiteboul" ],
       author [ "Buneman" ],
       author [ "Suciu" ]
       review [ "A darn fine book." ]
     ],
     book [
       title  [ "XML Query" ],
       author [ "Fernandez" ],
       author [ "Suciu" ]
       review [ "This is great!" ]
     ]}
:   {book [
       title [ String ],
       author [ String ] {1, *},
       review [ String ]
     ] {0, *}}

Because both the inputs and the results are unordered forests, the outer for expression and the inner for expression can be exchanged without changing the semantics:

    for r in review1/book do
      for b in listtobag(bib0/book) do
         where b/title/data() = r/title/data() do
            book [b/title, b/author, r/review]

Note that it is not necessary to explicitly transform bib0/book into an unordered forest, whenever one forest of a nested iteration is unordered, all other forests are cast to an unordered forest as well.

Sometimes it is convenient to transform an unordered forest to an ordered forest. The operator bagtolist sorts an unordered forest by document order.

2.9 Parent and cast operators

Many of the previous queries select values and return them or use them in the construction of new values. Once a value is selected, however, the previous queries do not access the original source of the selected value, i.e., the document or hierarchy of elements in which the selected value is contained. It is sometimes useful, however, to access or preserve the original context of selected nodes.

Consider the following example, which contains a new bibliography of articles in bib1:

    type Bib1 = bib [ Article{0, *} ]
    type Article =
      article [
        @year   [ Integer ],
        title   [ String ], 
        journal [ String ], 
        author  [ String ]{1, *}
      ]
    let bib1 : Bib1 =
      bib [
        article [
          @year   [ 2000 ], 
          title   [ "Queries and computation on the web" ],
          journal [ "Theoretical Computer Science" ], 
          author  [ "Abiteboul" ], 
          author  [ "Vianu" ]
        ]
      ] 

Assume there exists a full-text search function, contains, which given a set of documents, selects elements that contain a particular keyword. (This function is not defined in the Algebra, but is used here to illustrate a point.) The details of function application and declaration are given in [2.18 Functions].

The following expression returns those elements in bib0 and bib1 that contain the keyword "Abiteboul". Note that the result type of the expression is AnyTree{0,*}. This is because the contains function cannot know apriori which elements, if any, contain a given keyword.

    for a in contains(bib0,bib1; "Abiteboul") do
      a
==> author [ "Abiteboul" ], author [ "Abiteboul" ]
:   AnyTree{0,*}

The result above does not provide the context in which the two author elements occur. Even if the contains function did return more context, it might be useful to browse the context in which they occurred, for example, by accessing their parent and/or sibling elements. The built-in function parent accesses the parent of an attribute or element. For example, this expression returns more useful information than the previous one:

    for a in contains(bib0,bib1; "Abiteboul") do
      parent(a)
==> book [
      @year  [ 1999 ], 
      @isbn  [ "1-55860-622-X" ], 
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
    ], 
    article [
      @year   [ 2000 ], 
      title   [ "Queries and computation on the web" ],
      journal [ "Theoretical Computer Science" ], 
      author  [ "Abiteboul" ], 
      author  [ "Vianu" ]
    ]
:   AnyElement{0,*}

Note that the result type of the expression is AnyElement{0,*}. When applied to an attribute or element value, the parent function always has return type AnyElement{0,1}. This is because the Algebra's type system only preserves type information about an attribute or element's content, not about its containing parent.

It is possible to recover more precise type information with the dynamic or run-time cast operator, which attempts to cast at run time an expression to a given type. If the expression does not have the given type, a run-time error is raised. Dynamic casts are necessary when it not possible to determine at query-analysis time the most precise type of a value; they are sometimes called ``down casts''.

For example, the use of parent in the following expression loses some useful type information, that is that p is a Book. We can recover more precise information by casting p to the Book type:

    for p in parent(book0/title) do
      cast p : Book
==> book [
      @year  [ 1999 ], 
      @isbn  [ "1-55860-622-X" ], 
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
    ]
:   Book{0,1}

The result type is Book{0,1} because the parent function has type AnyElement{0,1}. If we try erroneously to cast p to an Article, the error value is returned. Its type is Ø, the empty choice:

    for p in parent(book0/title) do
      cast p : Article
==> error
:   0

We have already seen many examples of static or compile-time casting. A static cast permits the type of an expression to be changed and checked at query-analysis time; they are sometimes called ``up casts''. For example, consider the type Book0, which permits a book to have zero or more authors.

    type Book0 =
      book [
        @year  [ Integer ] &
        @isbn  [ String ],
        title  [ String ], 
        author [ String ]{0, *}
      ]

The explicit-type expression e : t statically casts a value to the given type. For example, the expression below statically casts book0 to Book0; this is permissible because the type of book0 at query-analysis time is a sub-type of Book0.

    book0 : Book0
==> book [
      @year  [ 1999 ], 
      @isbn  [ "1-55860-622-X" ], 
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
    ]
:   Book0

If we try erroneously to cast book0 to a more precise type (e.g., a book with 4 or more authors), a type error will occur at query-analysis time.

2.10 References and node identity

The uses of the parent operator in [2.9 Parent and cast operators] show that it is possible to access the original context of document nodes. This is possible because the XML Query Data Model supports node identity, that is, every instance of a node (e.g., element, attribute, processing instruction, and comment) in the data model has a unique identity. We can compare the identity of two nodes for equality using the == operator. For example, in the following expression, two distinct element nodes are created and bound to variables a1 and a2. Although the two nodes are structurally equal, their identities are not equal:

   
    let a1 = author [ "Suciu" ] do 
      let a2 = author [ "Suciu" ] do 
        a1 == a2
==> false
:   Boolean

In general, all the Algebra's operators preserve node identity. There is one exception: the element constructor, which given a tag name and a forest of children nodes, constructs a new element. A new element's content does not refer directly to the given children nodes, but to copies of these nodes. For example, the following expression constructs an element with name newbook and content book0/author, book0/title:

    let book1 = newbook [ book0/author, book0/title ] do
      book1
==> newbook [
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ], 
      title  [ "Data on the Web" ]
    ]
:   book [
      author[ String ]{1, *}, 
      title[ String ]
    ] 

The newbook element contains copies of the nodes in the forest book0/author, book0/title, not the original nodes in book0. Copying guarantees that a node is always the parent of its child nodes and a node is always the child of its parent; these constraints are invariants of [XML Query Data Model]. For example, we would expect that the following expression is always true:

    parent(book1/author) == book1 
If the element constructor did not copy its arguments, anomalies such as the following could occur:
    parent(book1/author) == book0 
that is, the parent of book1's child node is not book1, and this would violate the XML Query Data Model's parent-child invariant.

Sometimes it is useful to construct elements that do preserve the identity of its child nodes, for example, when constructing a view of one or more XML documents. In this case, we want the new element to contain references to, not copies of, the original nodes. The ref operator constructs a reference to a node. For example, book2 below contains references to the nodes in book0:

    let book2 = 
      newbook [ for a in book0/author do ref(a), 
                ref(book0/title) ] do
        book2
==> newbook [
      ref(author [ "Abiteboul" ]), 
      ref(author [ "Buneman" ]), 
      ref(author [ "Suciu" ]), 
      ref(title  [ "Data on the Web" ])
    ]
:   newbook [
      ref(author[ String ]){1, *}, 
      ref(title[ String ])
    ] 

Note that the type of the expression contains reference types. The deref operator dereferences a reference value. In the following, it returns the elements in book0.

    for v in book2/* do deref(v)
==> author [ "Abiteboul" ], 
    author [ "Buneman" ], 
    author [ "Suciu" ],
    title  [ "Data on the Web" ]
:   author[ String ]{1, *}, 
    title[ String ] 

For convenience, the expression above can be also be written as:

    book2/deref() 

2.11 Restructuring and grouping

Often it is useful to regroup elements in an XML document. For example, each book element in bib0 groups one title with multiple authors. This expression groups each author with the titles of his/her publications.

    for a in distinct_value(bib0/book/author/data()) do
      biblio [
        author[a], 
        for b in bib0/book do
          for a2 in b/author/data() do
            where a = a2 do
              b/title
      ]
==> biblio [
      author [ "Abiteboul" ], 
      title  [ "Data on the Web" ]
    ], 
    biblio [
      author [ "Buneman" ], 
      title  [ "Data on the Web" ]
    ], 
    biblio [
      author [ "Suciu" ], 
      title  [ "Data on the Web" ], 
      title  [ "XML Query" ]
    ], 
    biblio [
      author [ "Fernandez" ], 
      title  [ "XML Query" ]
    ]
:   biblio [
      author [ String ], 
      title  [ String ] {0, *}
    ] {0, *}

Readers may recognize this expression as a self-join of books on authors. The expression distinct_value(bib0/book/author/data()) produces a forest of author names whose values are all distinct. The outer for expression binds a to the name of each author element, and the inner for expression selects the title of each book that has some author whose name equals a.

Here distinct_value is an example of a built-in function. It produces a forest of nodes whose values are all distinct, i.e., there are no duplicate values; the order of the resulting forest is not defined. The builtin function distinct_node produces a forest of nodes whose identities are all distinct.

The type of the result expression may seem surprising: each biblio element may contain zero or more title elements, even though in bib0 every author co-occurs with a title. Recognizing such a constraint is outside the scope of the type system, so the resulting type is not as precise as we would like.

2.12 Querying order

Often it is useful to query the order of elements in an ordered forest or a document. There are two kinds of order among elements: local order and document (or global) order. The Algebra supports querying of local and global order.

Local order refers to the order among sibling elements in an ordered forest. To query local order, the index function pairs an integer index with each element in an ordered forest:

    index(book0/author)
==> pair [ fst [ 1 ], snd [ author [ ref("Abiteboul") ] ] ],
    pair [ fst [ 2 ], snd [ author [ ref("Buneman") ] ] ],
    pair [ fst [ 3 ], snd [ author [ ref("Suciu") ] ] ]
:   pair [ fst [ Integer ], snd [ ref(author [ String ]) ] ] {1, *}

The index function uses reference in order to preserve node identity when accessing local order. Note that the result type takes into account that at least one pair exists in the result, as book0/author always contains one or more authors.

Once we have paired authors with an integer index, we can select the first two authors:

    for p in index(book0/author) do
      where (p/fst/data() <= 2) do
        p/snd/deref()
==> author [ "Abiteboul" ],
    author [ "Buneman" ]
:   author [ String ] {0, *}

The for expression iterates over all pair elements produced by the index expression. It selects elements whose index value in the fst element is between one and two inclusive, and it returns the original content by dereferencing the content of the snd element.

The result type may be surprising, because the Book type guarantees that each book has at least one author. However, the type system cannot determine that the conditional where expression will always succeed, so the inner expression may produce zero results. (A sophisticated analysis might improve type precision, but is likely to require much work for little benefit.)

Document (or global) order refers to the total order among all elements in a document. Global order is defined as the order of appearance of the element nodes when performing a pre-order, depth-first traveral of a tree. This corresponds to the order of appearance of their opening tags in the XML serialization. This is equivalent to the definition used in [XPath].

To query global order, the < operator can be applied between two element nodes. It returns true if the first node is before (and different from) the second node in document order. It returns false if the first node is equal to or after the second node in document order. It raises an error if the nodes are in different documents. For example, the nodes bib0 and review0 are unrelated therefore comparing their order raises an error:

    bib0 < review0 
==> error
:   0

The > operator, defined similiarly, can be derived using the < operator and the node equality operator ==.

Using global order, the following expression returns all author nodes appearing after a book written in 2001:

    for b in bib0/book do
      where b/@year/data() = 2001 do
        for a in bib0/book/author do
          where b < a do a
==> author[ "Fernandez" ],
    author[ "Suciu" ]
:   author[ String ]{0,*}

Note that the root element of a document is before any other element. More generally, an element is before all of its children. For example, the set of elements that are before bib0 is empty:

    
    empty(for b in bib0/book do
            where bib0 > b do 
              b)
==> true
:   Boolean

The Algebra supports global order only for elements within the same document. Support for global order among elements in distinct documents is discussed in [Issue-0003:  Global Order].

2.13 Sorting

To sort a forest, the Algebra provides a sort expression, whose form is: sort Var in Exp1 by Exp2. The variable Var ranges over the items in the forest Exp1 and sorts those items using the key value defined by Exp2. For example, this expression sorts book elements in review0 by their titles. Note that the variable b appears in the key expression b/title.

    sort b in review0/book by b/title
==> book [ title  [ "Data on the Web" ], 
           review [ "This is great!" ] ], 
    book [ title  [ "XML Query" ], 
           review [ "This is pretty good too!" ] ]
:   book [ title [ String ], review [ String ] ] ] {0, *}

The sort expression is a restricted form of higher-order function, i.e., it takes a function as an argument. In this case, sort takes a single function, which extracts the key value from each element. The sort expression requires that the less-than inequality, <, be defined for the type of Var.

2.14 Aggregation

We have already seen several several built-in functions: index, distinct_value, and sort. In addition to these functions, the Algebra has five built-in aggregation functions: avg, count, max, min, and sum.

This expression selects books that have more than two authors:

    for b in bib0/book do
      where count(b/author) > 2 do
        b
==> book [
      @year  [ 1999 ], 
      @isbn  [ "1-55860-622-X" ], 
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

All the aggregation functions take a forest with repetition type and return an integer value; count returns the number of elements in the forest.

2.15 Expanded names

So far, all our examples of attributes and elements use unqualified local names, i.e., names that do not include an explicit namespace URI. It is also possible to specify and match on the expanded name of an attribute or element. The expanded name {Namespace}LocalName consists of a namespace URI Namespace and a local name LocalName.

Consider an inventory of books that contains data from both http://www.BooksRUs.com and http://www.cheapBooks.com. In this example, the first element contains values whose names are defined in the BooksRUs.com namespace, and the second element contains values whose names are defined in the cheapBooks.com namespace:

    let inventory = 
      inv [ 
        {"http://www.BooksRUs.com/books.xsd"}book [
          @year [ 1999 ], 
          @isbn [ "1-55860-622-X" ],
          {"http://www.BooksRUs.com/books.xsd"}title  [ "Data on the Web" ], 
          {"http://www.BooksRUs.com/books.xsd"}author [ "Abiteboul" ], 
          {"http://www.BooksRUs.com/books.xsd"}author [ "Buneman" ], 
          {"http://www.BooksRUs.com/books.xsd"}author [ "Suciu" ]
        ], 
        {"http://www.cheapBooks.com/ourschema.xsd"}book [
          @year [ 2001 ], 
          {"http://www.cheapBooks.com/ourschema.xsd"}title  [ "XML Query" ], 
          {"http://www.cheapBooks.com/ourschema.xsd"}author [ "Fernandez" ], 
          {"http://www.cheapBooks.com/ourschema.xsd"}author [ "Suciu" ]
          {"http://www.cheapBooks.com/ourschema.xsd"}isbn   [ "1-XXXXX-YYY-Z" ], 
        ]
       ] 
     ] : Inventory
    type Inventory = inv [ InvBook{0, *}]

In this example, elements imported from existing schemas each refer to a single namespace, thus the definition of InvBook is:

    type BooksRUBook = 
      {"http://www.BooksRUs.com/books.xsd"}book [
        @year [ Integer ] &
        @isbn [ String ],
        {"http://www.BooksRUs.com/books.xsd"}title  [ String ], 
        {"http://www.BooksRUs.com/books.xsd"}author [ String ]{1, *}
      ]
    type CheapBooksBook = 
      {"http://www.cheapBooks.com/ourschema.xsd"}book [
        @year  [ Integer ],
        {"http://www.cheapBooks.com/ourschema.xsd"}title  [ String ], 
        {"http://www.cheapBooks.com/ourschema.xsd"}author [ String ]{1, *},
        {"http://www.cheapBooks.com/ourschema.xsd"}isbn   [ String ],
      ]    
    type InvBook = BooksRUBook | CheapBooksBook

Here vertical bar (|) is used to indicate a choice between types: each InvBook is either a BooksRUBook or a CheapBooksBook.

We have already seen how to project on the constant name of an attribute or element. It is also useful to project on wildcards, which are used to match names with any namespace and/or any local name. For example, this expression matches elements with any local name and with namespace URI http://www.BooksRUs.com/books.xsd:

    inventory/{"http://www.BooksRUs.com/books.xsd"}* 
==> {"http://www.BooksRUs.com/books.xsd"}book [
      {"http://www.BooksRUs.com/books.xsd"}title  [ "Data on the Web" ], 
      {"http://www.BooksRUs.com/books.xsd"}author [ "Abiteboul" ], 
      {"http://www.BooksRUs.com/books.xsd"}author [ "Buneman" ], 
      {"http://www.BooksRUs.com/books.xsd"}author [ "Suciu" ]
    ]
:   {"http://www.BooksRUs.com/books.xsd"}book [
      {"http://www.BooksRUs.com/books.xsd"}title  [ String ], 
      {"http://www.BooksRUs.com/books.xsd"}author [ String ] {0, *}, 
    ]

Similarly, this expression first projects elements in any namespace whose local name is book and then projects on their year attributes:

    inventory/{*}book/@{*}year
==> @year  [ 1999 ], 
    @year  [ 2001 ]
:   (@year  [ Integer ] | 
     @year [ Integer ]){0,*} 

The expression Exp/a is shorthand for Exp/{}a, that is, the expanded name with the null namespace. Similarly, * is shorthand for {}*.

2.16 Comments and processing instructions

In an XML document, comments and processing instructions (PIs) may appear anywhere outside other markup[XML]. Processing instructions (PIs) permit documents to contain instructions for applications. Comments and PIs are not part of the document's character data. An XML processor may, but need not, make the text of comments available to an application, but it must pass PIs to the application. The PI begins with a target used to identify the application to which the instruction is directed.

The Algebra supports comments and processing instructions in types and expressions. The type expression pic t denotes a value in which zero or more PIs and comments may be interleaved arbitrarily with the nodes in type t. For example, the two element types BibPIC and BookPIC permit PIs and comments to be interleaved with their content:

    type BibPIC  = bib [ pic BookPIC{0,*} ]
    type BookPIC = 
      book [ 
        @year  [ Integer ] &
        @isbn  [ String ],
        pic    (title  [ String ], author [ String ]{1, *})
      ]
Note that in the book element, the pic operator is only applied to its element content, not its attribute content, because comments and processing instructions may not occur in attributes.

We can construct PI and comment values using the built-in constructors pi and comment:

     let bibpc0 : BibPIC = 
      bib [ 
        comment("Canonical XML Query Algebra example."), 
        bib [
          book [
            @year  [ 1999 ], 
            @isbn  [ "1-55860-622-X" ], 
            comment("First book example"), 
            processing_instruction("Publisher.asp", "publisher=http://www.mkp.com")
            title  [ "Data on the Web" ], 
            author [ "Abiteboul" ], 
            author [ "Buneman" ], 
            author [ "Suciu" ]
          ], 
          book [
            @year  [ 2001 ], 
            @isbn  [ "1-XXXXX-YYY-Z" ], 
            title  [ "XML Query" ], 
            comment("Second book example"), 
            author [ "Fernandez" ], 
            author [ "Suciu" ]
          ]
        ]
      ]

Finally, we can project on processing instructions and comments, in the same way we project on children, attributes, and atomic content:

    
    bibpc0/book/comment()
==> comment("First book example"), comment("Second book example")
:   Comment{0,*}

    bibpc0/book/processing_instruction()
==> processing_instruction("Publisher.asp", "publisher=http://www.mkp.com")
:   PI{0,*}

Comments and processing instructions may be ignored by an XML processor, in which case they would not even be accessible to a query processor. If they are not ignored, however, comments and processing instructions are typed values and are treated like any other value in an algebraic expression.

2.17 Mixed Content

An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements[XML]. The type expression mixed t denotes a value in which zero or more String values may be interleaved arbitrarily with the nodes in type t. For example, the content of the review element contains a reviewer element, which may be interleaved with string values:

    type ReviewsMixed  =
      reviews [
        book [ 
          title  [ String ], 
          review [ mixed reviewer [ String ] ]
        ]{0, *}
      ]

Here are two examples of mixed content, in which the text of the book review may be interleaved with the name of the reviewer:

    let reviewmix0 : ReviewsMixed =
      reviews [
        book [
          title  [ "XML Query" ], 
          review [ "A darn fine book: ", reviewer [ "XML On-line" ] ] 
        ], 
        book [
          title  [ "Data on the Web" ], 
          review [ "The ", reviewer [ "publisher" ], " says 'This is great!'" ] 
        ]
      ] 

It is often useful to concatenate all the string values of a mixed-content element to recover its complete text value. We use the builtin function string; as defined in XPath [XPath], the string value of a node is determined by its kind, e.g., element, attribute, etc.

    for b in reviewmix0/book do 
      string(b/review)
==> "A darn fine book : XML On-line",
    "The publisher says 'This is great!'" 

2.18 Functions

Functions can make queries more modular and concise. Recall that we used the following query to find all books that do not have "Buneman" as an author.

    for b in bib0/book do
      where empty(for a in b/author do
                    where a/data() = "Buneman" do
                      a) do
        b
==> book [
      @year  [ 2001 ], 
      @isbn  [ "1-XXXXX-YYY-Z" ], 
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

A different way to formulate this query is to first define a function that takes a string s and a book b as arguments, and returns true if book b does not have an author with name s.

     fun notauthor (s : String; b : Book) : Boolean =
       empty(for a in b/author do
               where a/data() = s do
                 a)

The query can then be re-expressed as follows.

    for b in bib0/book do
      where notauthor("Buneman"; b) do
        b
==> book [
      @year  [ 2001 ], 
      @isbn  [ "1-XXXXX-YYY-Z" ], 
      title  [ "XML Query" ], 
      author [ "Fernandez" ], 
      author [ "Suciu" ]
    ]
:   Book{0, *}

We use semicolon rather than comma to separate function arguments, since comma is used to concatenate forests.

Note that a function declaration includes the types of all its arguments and the type of its result. This is necessary for the type system to guarantee that applications of functions are type correct.

In general, any number of functions may be declared at the top-level. The order of function declarations does not matter, and each function may refer to any other function. Among other things, this allows functions to be recursive (or mutually recursive), which supports structural recursion, the subject of the next section.

Functions make the Algebra extensible. We have seen examples of built-in functions (sort and distinct_value) and examples of user-defined functions (notauthor). In addition to built-in and user-defined functions, the Algebra could support externally defined functions, i.e., functions that are not defined in the Algebra itself, but in some external language. This would make special-purpose implementations of, for example, full-text search functions available in the Algebra. We discuss support for externally defined functions in [Issue-0009:  Externally defined functions].

2.19 Structural recursion

XML documents can be recursive in structure, for example, it is possible to define a part element that directly or indirectly contains other part elements. In the Algebra, we use recursive types to define documents with a recursive structure, and we use recursive functions to process such documents. (We can also use mutually recursive functions for more complex recursive structures.)

For instance, here is a recursive type defining a part hierarchy.

    type Part = Basic | Composite
    type Basic = 
      basic [
        cost [ Integer ]
      ]
    type Composite = 
      composite [
        assembly_cost [ Integer ], 
        subparts [ Part{1, *} ]
      ]

And here is some sample data.

    let part0 : Part =
      composite [
        assembly_cost [ 12 ], 
        subparts [
          composite [
            assembly_cost [ 22 ], 
            subparts [
              basic [ cost [ 33 ] ]
            ]
          ], 
          basic [ cost [ 7 ] ]
        ]
      ]

Here vertical bar (|) is used to indicate a choice between types: each part is either basic (no subparts), and has a cost, or is composite, and includes an assembly cost and subparts.

We might want to translate to a second form, where every part has a total cost and a list of subparts (for a basic part, the list of subparts is empty).

    type Part2 =
      part [
        total_cost [ Integer ], 
        subparts [ Part2{0, *} ]
      ]

Here is a recursive function that performs the desired transformation. It uses a new construct, the match expression.

    fun convert(p : Part) : Part2 =
      match p 
        case b : Basic do
          part [
            total_cost [ b/cost/data() ], 
            subparts []
          ]
        case c : Composite do 
          let s = (for y in c/subparts/*) do 
                     convert(y)) do
            part [
              total_cost [
                q/assembly_cost/data() + 
                  sum(s/total_cost/data())
              ], 
              subparts[ s ]
            ]
        else error

Each branch of the match expression is labeled with a type, Basic or Composite, and with a corresponding variable, b or c. The evaluator checks the type of the value of p at query-evaluation time, also known as run time, and evaluates the corresponding branch. If the first branch is taken then b is bound to the value of p, and the branch returns a new part with total cost the same as the cost of b, and with no subparts. If the second branch is taken, then c is bound to the value of p. The function is recursively applied to each of the subparts of c, giving an ordered forest of new subparts s. The branch returns a new part with total cost computed by adding the assembly cost of c to the sum of the total cost of each subpart in s, and with subparts s.

One might wonder why b and c are required, since they have the same value as p. The reason why is that p, b, and c have different types.

    p : Part
    b : Basic
    c : Composite

The types of b and c are more precise than the type of p, because which branch is taken depends upon the type of value in p.

Applying the query to the given data gives the following result.

    convert(part0)
==> part [
      total_cost [ 74 ], 
      subparts [
        part [
          total_cost [ 55 ], 
          subparts [
            part [
              total_cost [ 33 ], 
              subparts []
            ]
          ]
        ], 
        part [
          total_cost [ 7 ], 
          subparts []
        ]
      ]
    ]

Of course, a match expression may be used in any query, not just in a recursive one.

2.20 Functions for all well-formed documents

Recursive types allow us to define a type that matches any well-formed XML document. This type is called AnyTree :

    type AnyTree        = AnyScalar | AnyElement | AnyAttribute
    type AnySimpleType  = AnyScalar | AnyScalar{0,*}
    type AnyAttribute   = @*[ AnySimpleType ]
    type AnyElement     = *[ AnyComplexType ]
    type AnyComplexType = AnyTree{0, *}
    type AnyType        = AnySimpleType | AnyComplexType

AnyTree stands for any scalar, element, or attribute value. Here AnyScalar is a built-in scalar type. It stands for the most general scalar type, and all other scalar types (like Integer or String) are subtypes of it. AnySimpleType stands for the most general simple type, which is a scalar or a list of scalars. AnyAttribute stands for the most general attribute type. The asterisk (*) is used to indicate a wild-card type, i.e., a type whose name is not known statically. The type AnyAttribute may have any name, and its content must have type AnySimpleType, i.e., it may contain atomic values, but no elements. The AnyElement stands for the most general element type, which may have any name, and its content must be a complex type, which is a repetition of zero or more AnyTrees. Finally, an AnyType is either an AnySimpleType or an AnyComplexType. In other words, any element, attribute, or atomic value has type AnyType.

The use of * is a significant extension to XML Schema, because XML Schema has no type corresponding to *[t], where t is some type other than AnyComplexType. It is not clear that this extension is necessary, since the more restrictive expressiveness of XML Schema wildcards may be adequate.

In particular, our earlier data also has type AnyTree.

    book0 : AnyTree
==> book [
      @year  [ 1999 ], 
      @isbn  [ "1-55860-622-X" ], 
      title  [ "Data on the Web" ], 
      author [ "Abiteboul" ], 
      author [ "Buneman" ], 
      author [ "Suciu" ]
    ]
:   AnyTree

A specific type can be indicated for any expression in the query language, by writing a colon and the type after the expression.

As an example of a function that can be applied to all well-formed documents, we define a recursive function that converts any XML data into HTML. We first give a simplified definition of HTML.

    type HTML_body =
      ( AnyScalar
      | b [ HTML_body ]
      | ul [ li [ HTML_body ] {0, *} ]
      ) {0, *}

An HTML body consists of a sequence of zero or more items, each of which is either a scalar, or a b element, where the content is an HTML body, or an ul element, where the children are li elements, each of which has as content an HTML body.

Now, here is the function that performs the conversion.

    fun html_of_xml( x : AnyTree ) : Html_Body =
      match x 
        case s : AnyScalar do s
        case v1 : AnyAttribute do 
          b [ name(v1) ], 
          ul [ for y in v1/* do li [ html_of_xml(y) ] ],
        case v2 : AnyElement do 
          b [ name(v2) ], 
          ul [ for y in attributes(v2) do li [ html_of_xml(y) ], 
          ul [ for y in v2/* do li [ html_of_xml(y) ] ]
        else error

The first branch of the match expression checks whether the value of x is a subtype of AnyScalar, and if so then s is bound to that value, so if this branch is taken then s is the same as x, but with a more precise type (it must be a scalar, not an element). This branch returns the scalar.

The second branch checks whether the value of x is a subtype of AnyAttribute. As before, v1 is the same as x but with a more precise type (it must be an attribute, not a scalar). This branch returns a b element containing the name of the attribute, and a ul element containing one li element for each value of the attribute. The function is recursively applied to get the content of the li element. The last branch is analogous to the second, but it matches an element instead of an attribute, and it applies html_of_xml to each of the element's attributes and children.

Applying the query to the book element above gives the following result.

    html_of_xml(book0)
==> b [ "book" ], 
    ul [
      li [ b [ "year" ],   ul [ li [ 1999 ] ] ], 
      li [ b [ "isbn" ],   ul [ li [ "1-55860-622-X" ] ] ],  
      li [ b [ "title" ],  ul [ li [ "Data on the Web" ] ] ], 
      li [ b [ "author" ], ul [ li [ "Abiteboul" ] ] ], 
      li [ b [ "author" ], ul [ li [ "Buneman" ] ] ], 
      li [ b [ "author" ], ul [ li [ "Suciu" ] ] ]
    ]
:   HTML_Body

2.21 Top-level queries and XML results

A query consists of a sequence of top-level expressions, or query items, where each query item is either a type declaration, a function declaration, a global variable declaration, or a query expression. The order of query items is immaterial; all type, function, and global variable declarations may be mutually recursive.

Each query expression is evaluated in the environment specified by all of the declarations. (Typically, all of the declarations will precede all of the query expressions, but this is not required.) We have already seen examples of type, function, and global variable declarations. An example of a top-level query is:

    query html_of_xml(book0)

To transform any expression into a top-level query, we simply precede the expression by the query keyword. The result of a top-level query can be serialized into an XML document by the application in which the Algebra is used.

3 Algebra Components

In this section, we summarize the Algebra and present the grammars for types and expressions. We start by defining the non-terminal and terminal symbols that appear in the type and expression grammars.

3.1 Expanded names

A Namespace is a URI, and a LocalName is an NCName, as in the Namespace recommendation [XML Names]. We let Namespace range over namespaces and the null value, which represents the absence of a namespace, and LocalName range over NCNames. The symbol Name ranges over expanded names. An expanded name is a (namespace, local name) pair.

namespace Namespace ::= URI
| The null namespace is denoted by the empty string
expanded name Name | {Namespace}LocalName

The unqualified expression LocalName is a syntactic shorthand for {}LocalName. In the remainder of the document, all expanded names include a namespace qualifier.

3.2 Wildcards

A wildcard denotes a set of names. A wildcard item is of the form {*}* denoting any name in any namespace, {Namespace}* denoting any name in namespace Namespace, {*}LocalName denoting the local name LocalName in any namespace, or {Namespace}LocalName denoting just the name with namespace Namespace and local name LocalName. A wildcard consists of wildcard items, union of wildcards, or difference of wildcards. We let WItem range over wildcard items, and Wild range over wildcard expressions.

wildcard item WItem ::= {*}* any namespace, any local name
| {Namespace}* namespace Namespace, any local name
| {*}LocalName any namespace, local name LocalName
| Name a constant namespace and local name
| * shorthand for {}*
wildcard Wild ::= WItem all names in Wild
| Wild1|Wild2 any name in Wild1 or Wild2
| Wild1!Wild2 all names in Wild1 and not in Wild2

For example, the wildcard {*}*!{"http://www.foo.org/baz}*!{"http://www.xsd.com/xsi}String denotes any name in any namespace except for names in namespace http://www.foo.org/baz and for local name String in namespace http://www.xsd.com/xsi.

Wildcards denote sets of expanded names, so we can define a containment relation, <:wild that relate sets of wildcards. The inequalities that hold for this relation include:

{Namespace}LocalName <:wild {*}LocalName
{Namespace}LocalName <:wild {Namespace}*
{Namespace}* <:wild {*}*
{*}LocalName <:wild {*}*
{Namespace}LocalName <:wild {*}*

The last inequality holds by transitivity. Note, however, that {*}LocalName <:wild {Namespace}* does not necessarily hold.

3.3 Atomic types

The Algebra takes as given the atomic datatypes from XML Schema Part 2 [XML Schema Part 2]. We let p range over all atomic datatypes.

atomic datatype p ::= Integer | Float | String
| Boolean | URI | NCName, etc.

Typically, an atomic datatype is either a primitive datatype, or is derived from another atomic datatype by specifying a set of facets. A type hierarchy is induced between scalar types by containment of facets. Note that lists of atomic datatypes are specified using repetition and unions are specified using alternation, as defined in [3.4 Types : abstract syntax]

The built-in atomic data type AnyScalar stands for the most general scalar type, and all other scalar types (like Integer or String) are subtypes of it.

3.4 Types : abstract syntax

[Figure 1]  contains the abstract syntax for the Algebra's type system. A type is closely related to a model group as defined in [MSL]. MSL uses standard regular expression notation for model groups and we do the same for algebra types. This type system captures the essence of [XML Schema Part 1] and is similar to the type system used in XDuce [HP2000].


type variable y    
type t ::= y type variable
    | () empty sequence
    | Ø empty choice
    | Wild wildcard type
    | u unit type
    | t1 , t2 sequence, t1 followed by t2
    | t1 & t2 interleaved product, nodes in t1 interleaved with nodes in t2
    | t1 | t2 choice, t1 or t2
    | t {m, n} repetition of t between m and n times
    | {t} unordered forest of nodes in t
    | pic t t interleaved with comments and processing instructions
    | mixed t t interleaved with strings
bound m, n ::= natural number or *
unit type u ::= p atomic datatype
    | @Wild[t] attribute with name in Wild and content in t
    | Wild[t] element with name in Wild and content in t
    | PI processing instruction
    | Comment comment
    | ref (t) reference to t
prime type q ::= u
    | q | q


Figure 1: Abstract Syntax for Types

The Wild type is necessary, because the Algebra's syntax permits the construction of name expressions (See NameExp in [3.6 Expressions]), whose type is Wild.

A unit type is is either an atomic type, an attribute or element type with a constant or wildcard name, a comment, or a processing instruction. A prime type is a unit type or a disjunction of prime types. Unit and prime types appear in several typing rules in [4 Static Semantics : Type-Inference Rules].

The empty sequence matches only the empty document; it is an identity for sequence and all. The empty choice matches no document; it is an identity for choice.

() , t = t = t , ()
() & t = t = t & ()
Ø | t = t = t | Ø

The interleaved product (&, also known as the shuffle product) is a generalization of XML Schema's [XML Schema Part 1] allgroups. t1 & t2 matches any sequence that is an interleaving of a sequence that matches t1 and a sequence that matches t2. For example,

     (a[],b[]) & c[] =
	      a[],b[],c[]
	  |   a[],c[],b[]
	  |   c[],a[],b[]

As another example, a[]{0,*}&b[] matches any sequence of a[] and b[] that has exactly one b[].

Allgroups in XML Schema may only consist of global or local element declarations with lower bound 0 or 1, and upper bound 1. With these restrictions, an allgroup in XML Schema is equivalent to p1{m1,1} & ... & pn{mn,1}, where pi are element declarations and 0 <= mi <= 1.

The repetition t{m,n} denotes a sequence of at least m and at most n t. The bounds on a repetition type will be either a natural number (that is, either a positive integer or zero) or the special value *, meaning unbounded. We extend arithmetic to include * in the obvious way:

m + * = * + m = *
0 · * = * · 0 = 0
m · * = * · m = * if m ¹ 0
m min * = * min m = m
m max * = * max m = *
m <= * = true
* < m = false

For technical reasons, we allow the lower bound of a repetition to be *. A repetition t {m, n} is equivalent to the empty choice Ø if m > n or if m is *.

The bag type {t} ignores the order of nodes in t. Note that this semantics differs from the conventional one. {t} does not denote a bag of nodes, each with type t, but a bag of the nodes in the list with type t.

The type expression pic t is a convenient shorthand for the type (PI | Comment){0,*} & t, that is, a type in which PIs and comments may be interleaved with nodes in type t. Similarly, the type expression mixed t is a convenient shorthand for the type String{0,*} & t.

The Algebra's external type system, that is, the type definitions associated with input and output documents, is XML Schema. The internal types are in some ways more expressive than XML Schema, for example, XML Schema has no type corresponding to {*}*[t] where t is some type other than AnyComplexType. In general, mapping XML Schema types into internal types will not lose information, however, mapping internal types into XML Schema may lose information.

3.5 Types : concrete syntax

It is useful to define a concrete syntax for the Algebra's types using syntactic categories that specify various subsets of types. Syntactic classes can indicate, for example, that the content of an attribute should be a simple type and that the content of an element should consist of attributes followed by elements. These restrictions guarantee that errors in type construction can be caught during parsing of a query.

In the concrete syntax for types, we capitalize non-terminal symbols. We first distinguish between type variables used to name attribute groups, element groups, and the content types of elements.

TypeVar ::= AttrGroupVar
| ElemGroupVar
| ContentTypeVar

A simple type is either an atomic type, a list of atomic types, or a choice of atomic types, which allows a choice of atomic lists.

SimpleType ::= p atomic
| p{m,n} list of atomic
| SimpleType | SimpleType choice of atomic

An attribute group defines the syntactic form of attributes: their content may only be a simple type and they are combined only with the interleaving operator, but not the sequence operator.

AttrGroup ::= @Wild[SimpleType]
| @Wild[SimpleType]{0,1} optional attribute
| AttrGroup & AttrGroup
| AttrGroupVar
| (AttrGroup)
| ()

An element group contains elements with constant or wildcard names and they are combined in sequences, choices, interleavings, and repetitions.

ElemGroup ::= Wild[ContentType] element
| ElemGroup , ElemGroup
| ElemGroup | ElemGroup
| ElemGroup & ElemGroup
| ElemGroup{m,n}
| ElemGroupVar
| pic ElemGroup
| mixed ElemGroup
| (ElemGroup)
| ()
| Ø

The content type of an element is either an attribute group, an element group, a sequence of an attribute group followed by an element group, or a content-type variable.

ContentType ::= AttrGroup
| ElemGroup
| AttrGroup , ElemGroup
| ContentTypeVar content-type variable

Finally, a type in an algebraic expression may be an attribute group, an element group, or a content type:

Type ::= AttrGroup | ElemGroup | ContentType

Ed. Note: MF (Nov-16-2000): This needs work.

3.6 Expressions

[Figure 2]  contains the concrete syntax for the Algebra's ``core'' expressions. We define the Algebra's typing rules on these expressions. Typing rules for the Algebra are defined in [4 Static Semantics : Type-Inference Rules]. We have seen examples of most of the expressions, so we will only point out details here.


name Name
function FuncName
variable Var
constant Const atomic-datatype constant
equality/inequality operator Opeq ::= = | == | != | < | <= | > | >=
arithmetic operator Oparith ::= + | - | * | / | mod | div
boolean operator Opbool ::= and | or
binary operator BinaryOp ::= Opeq | Oparith | Opbool
unary operator UnaryOp ::= + | - | not
name expression NameExp ::= Name
| {Exp}Exp computed name
expression Exp ::= Const atomic constant
    | NameExp name expression
    | Var variable
    | @NameExp[Exp] attribute
    | NameExp[Exp] element
    | Exp, Exp sequence or union
    | Exp - Exp difference
    | Exp /\ Exp intersection
    | () empty sequence
    | if Exp then Exp else Exp conditional
    | let Var = Exp do Exp local binding
    | FuncName (Exp ;...; Exp) function application
    | Exp : Type explicit type
    | error error
    | Exp BinaryOp Exp binary operator
    | UnaryOp Exp unary operator
    | sort v in Exp by Exp sort expression
    | for Var in Exp do Exp iteration
    | match Exp CaseRules match
case rules CaseRules ::= case Var : Type do Exp CaseRules
    | else Exp
query item Query ::= type TypeVar = Type type declaration
    | fun FuncName (Var : Type ; ... ; Var : Type): Type =
        Exp function declaration
    | let Var : Type = Exp global variable declaration
    | query Exp query expression


Figure 2: Core Algebra Expression Syntax

Many of the expressions that appear in the examples, for example Exp/*, do not appear in [Figure 2], because they are reducible to expressions in the core syntax. [Figure 3] lists these reducible expressions. In [A Equivalences], we give the laws that relate a reducible expression with its equivalent expression in the core algebra.


attributes (Exp) attributes
children (Exp) children
Exp / @Wild attribute projection
Exp / Wild element projection
Exp / data() atomic projection
Exp / comment() comment projection
Exp / processing_instruction() processing instruction projection
Exp / deref() dereference projection
where Exp do Exp where conditional
empty (Exp) emptiness predicate
cast Exp : Type run-time type cast
let Var : Type = Exp do Exp local variable with type


Figure 3: Reducible Algebra Expressions

3.7 Operators

In addition to the core grammar, the Algebra has a set of operators and built-in functions. The binary and unary operators are enumerated in [Figure 2]. They include two equality operators, == and =, which are also required by the [XML Query Data Model], and five inequality operators, <=, <, >=, >, and !=. The < operator is overloaded: when applied to two document nodes, it compares their global document order. We have not defined the semantics of all the binary operators in the Algebra. In particular, it might be useful to define more than one type of equality over scalar and element values, or to define implicit coercions between values of related types. A joint task force on operators with members from the [XSLT 99], XML Schema, and XML Query working groups is chartered to define operators. The Algebra will adopt the decisions of that group (See [Issue-0056:  Operators on Simple Types]).

3.8 Built-in functions

Some of the built-in functions are constructors and accessors defined in the [XML Query Data Model]. They are listed in [Figure 4]. The remaining built-in functions are listed in [Figure 5]. One benefit of having built-in functions is that more precise types can be given to these functions than to user-defined functions. The type rules for these functions appear in [4 Static Semantics : Type-Inference Rules].


bagtolist(Exp) Sorts unordered forest by document order.
comment(Exp) Constructs a comment.
deref(Exp) Dereferences a node reference.
listtobag(Exp) Disregards order of nodes in argument.
localname(Exp) Extracts local NCName from a QName value.
name(Exp) Returns element or attribute's tag name.
namespace(Exp) Extracts URI namespace from a QName value.
nodes(Exp) Returns nodes in content of element or attribute.
parent(Exp) Returns the parent of a document node.
processing_instruction(Exp, Exp) Constructs a processing instruction.
ref(Exp) Constructs a node reference.
string(Exp) Returns string value of given node, as defined in [XPath].
target(Exp) Returns the target name of a processing instruction.
value(Exp) Returns the value of a comment or processing instruction.


Figure 4: Data Model Constructor and Accessor Functions


agg(Exp) Aggregation functions, where agg is one of avg, count, min, max, sum.
distinct_node(Exp) Removes duplicate nodes from a forest.
distinct_value(Exp) Removes duplicate values from a forest.
index(Exp) Pairs each element of an ordered forest with integer index.


Figure 5: Built-In Functions

4 Static Semantics : Type-Inference Rules

The Algebra's static semantics is presented as type inference rules, which relate Algebra expressions to types and and specify under what conditions an expression is well typed. The rules for typing expressions are described with an inference rule notation, which is now widely used for describing type systems and semantics of programming languages. For a textbook introduction to type systems, see, for example, [Mitchell].

In inference notation, when all judgements above the line hold, then the judgement below the line holds as well. Here is an example of a rule used later on:

|- Data1: t1      |- Data2: t2

|- (Data1, Data2): (t1, t2)

Data is the subset of expressions that consists only of scalar constant, attribute, element, sequence, and empty-sequence expressions.

Data ::= Const atomic constant
    | @Name[Data] attribute
    | Name[Data] element
    | Data, Data sequence
    | () empty sequence

The judgement "|- Data : t" is read as "in the empty environment, the value Data has type t." The symbol |- is called a turnstyle, and is usually preceded by an environment symbol, which represents a mapping from variables to values. In this example, there is no environment symbol, which means the judgement holds in the empty environment. In [4.4 Expressions], we will see examples of rules that have non-empty environments. The rule states that if both Data1 : t1 and Data2 : t2 hold, then (Data1, Data2): (t1, t2) holds as well. For instance, take Data1 = a[1], Data2 = b["two"], t1 = a[Integer], and t2 = b[String].

Then since both a[1] : a[Integer] and b["two"] : b[String] hold, we may conclude that (a[1], b["two"]) : (a[Integer], b[String]) holds.

4.1 Relating data to types

The following type rules relate Data values to types. The next rule states that for any constant whose lexical form defines a value of type p, the constant has type p:


|- Constp: p

The next rule states that any constant also has type AnyScalar:


|- Const : AnyScalar

The next two rules are for attribute and element construction. The first rule states that if Data has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.

|- Data : t

|- @Name[Data]: @Name[t]

|- Data : t

|- Name[Data]: Name[t]

The next rule is for typing sequences and was already described at the beginning of [4 Static Semantics : Type-Inference Rules].

|- Data1: t1      |- Data2: t2

|- (Data1, Data2): (t1, t2)

The next rule states that the empty sequence value always has the empty sequence type:


|- (): ()

The rules above associate the most specific type possible with a Data value. The remaining rules in this section associate more general types with a Data value, which are necessary when the type system must determine whether a Data value with most specific type t is permissible when a value of type t' is expected. This occurs during type checking of a query.

The next two rules are also for attribute and element construction, but these rules specify more general types. The first rule states that if Exp has type t, and Name is in the set of names defined by the wildcard expression Wild, then the given attribute expression also has type @Wild[t]. The subsequent rule is analogous for element expressions.

|- Data : t      Name <:Wild Wild

|- @Name[Data] : @Wild[t]

|- Data : t

|- Name[Data] : Wild[t]

The next rule states that if the value Data has type t1, then it also has type t1| t2. The subsequent rule is analogous and states that the choice operator is associative.

|- Data : t1

|- Data : (t1| t2)

|- Data : t2

|- Data : (t1| t2)

The next rule states that the sequence of one value with type t followed by a value with repetition type t{m,n} has repetition type t{m+1,n +1}.

|- Data1: t      |- Data2: t {m, n}

|- (Data1, Data2): t {m +1, n +1}

The next rule states that the sequence of one value with type t followed by a value with repetition type t{0,n} has repetition type t{0,n +1}.

|- Data1: t      |- Data2: t {0, n}

|- (Data1, Data2): t {0, n +1}

The next rule states that the empty sequence is a repetition of any type with lower bound 0:

 

|- (): t {0, n}

4.2 Equivalences and subtyping

The symbol <: denotes the subtype relation. We write t1<: t2 if for every data Data such that |- Data : t1, it is also the case that |- Data : t2, i.e., is t1 is a subtype of t2. The subtyping relation is used in many of the type rules that follow. It is easy to see that the subtype relation <: is a partial order, i.e., it is reflexive, t <: t, and it is transitive, if t1<: t2and t2<: t3then t1<: t3. Here are some of the inequalities that hold:

Ø <: t
t <: AnyType
p <: AnyScalar
t1 <: t1 | t2
t2 <: t1 | t2

Further, if Name <:wild Wild and t <: t', then

Name[t] <: Wild[t']

If t <: t' and m >= m' and n <= n' then

t {m, n} <: t' {m', n'}

If t <: t' then

t <: {t'}

If t1<: t1' and t2<: t2' then

t1, t2 <: t1', t2'
t1, t2 <: t1' & t2'
t1 & t2 <: t1' & t2'

We write t1= t2 if t1<:t2 and t2<: t1. Here are some of the equalities that hold:

AnyScalar = Integer | String | Boolean, etc.
(t1, t2), t3 = t1, (t2, t3)
t, () = t
(), t = t
t, Ø = Ø
Ø, t = Ø
t1| t2 = t2| t1
(t1| t2) | t3 = t1 | (t2| t3)
t | Ø = t
Ø | t = t
t1, (t2| t3) = (t1, t2) | (t1, t3)
(t1| t2), t3 = (t1, t3) | (t2, t3)
(t1 & t2) & t3 = t1 & (t2 & t3)
t1 & t2 = t2 & t1
t & () = t
() & t = t
t & Ø = Ø
Ø & t = Ø
Wild[t1| t2] = Wild[t1] | Wild[t2]
Wild[t] | {*}*[t] = {*}*[t]
t {m +1, n +1} = t, (t {m, n})
t {0, n +1} = () | t, (t {0, n})
t {0, 0} = ()
t {m, n} = Ø, if m > n or m = *
{{t}} = {t}
{p} = p
{()} = ()
{Ø} = Ø
{t}{m , n} := {t{m , n}}
{t1} | {t2} := {t1 | t2}

Ed. Note: PF, Dec 11 2000: The last two equivalence rules are necessary to reuse the auxiliary typing rules for iteration by for for iteration over unordered forests (see [4.8 Iteration expressions].)

Ed. Note: MF, Jan 09 2000: Peter, do you mean := above or =?

Ed. Note: PF, Jan 29 2000: yes, these equivalences are by definition

We also have that t1<: t2 if and only if t1| t2= t2.

We define the intersection t1 /\ t2 of two types t1 and t2 to be the largest type t that is smaller than both t1 and t2. That is, t = t1 /\ t2 if t <: t1 and t <: t2 and if for any t' such that t' <: t1 and t' <: t2, we have t' <: t.

4.3 Environments

The type rules use an environment that specifies the types of variables and functions. The type environment is denoted by G, and is composed of a comma-separated list of variable types, Var : t or function types, FuncName :(t1 ;  ... ;  tn) : t. We retrieve type information from the environment by writing (Var : t) in G to look up a variable, or by writing (FuncName : (t1;...; tn) : t) in G to look up a function.

The type checking starts with an environment that contains all the types declared for functions and global variables. For instance, before typing the first query of [2.2 Projection], the environment contains:

G = bib0 : Bib, book0 : Book 

While doing the type-checking, new variables will be added in the environment. For instance, when typing the query of [2.4 Iteration], variable b will be typed with Book, and added in the environment. This will result in a new environment:

G' = G, b : Book

4.4 Expressions

We write G |- Exp : t if in environment G the expression Exp has type t. Below are all the rules except those for for and match expressions, which are discussed in later subsections.

The next rule states that in any environment G, a constant whose lexical form defines a value of type p has type p. So, for example, G |- 1 : Integer and G |- "two" : String.


G |- Constp: p

The next five rules are for typing expressions that define names.

 

|- LocalName: {}LocalName

 

|- {Namespace}LocalName: {Namespace}LocalName

G |- Exp : NCName

G |- {Namespace}Exp: {Namespace}*

G |- Exp : URI

G |- {Exp}LocalName: {*}LocalName

G |- Exp1: URI      G |- Exp2: NCName

G |- {Exp1}Exp2: {*}*

The next rule is a trivial definition of the variable Var having type t in environment G:

(Var : t) in G

G |- Var : t

The next two rules are for attribute and element construction with a constant tag name. The first rule states that if Exp has type t, then the attribute expression @Name[Data] has type @Name[t]. The subsequent rule is analogous for element expressions.

G |- Exp : t

G |- @Name[Exp]: @Name[t]

G |- Exp : t

G |- Name[Exp]: Name[t]

The next two rules are for attribute and element construction in which the tag name is computed. The first rule states that if Exp1 is in the set of names defined by Wild, and Exp2 has type t, then the given computed attribute expression has type @Wild[t]. The subsequent rule is analogous for element expressions.

G |- Exp1: Wild      G |- Exp2: t

G |- @Exp1[ Exp2]: @Wild[t]

G |- Exp1: Wild      G |- Exp2: t

G |- Exp1[ Exp2]: Wild[t]

The following two rules are analogous to the sequence and empty sequence rules in [4.1 Relating data to types].

G |- Exp1: t1      G |- Exp2: t2

G |- Exp1, Exp2: t1, t2


G |- () : ()

Note that in the type rule for a conditional expression, the result type is the choice (t2|t3).

G |- Exp1: Boolean      G |- Exp2: t2      G |- Exp3: t3

G |- if Exp1 then Exp2 else Exp3: (t2 | t3)

A let expression extends the current environment Gamma; with the variable Var with type t. Note that Exp2, the body of the let expression, is typed in the extended environment, and the type of the entire let expression is t2.

G |- Exp1: t1      G, Var : t1 |- Exp2: t2

G |- let Var = Exp1 do Exp2: t2

The next rule is for function application. In a function application, the type of each actual argument to the function must be a subtype of the corresponding formal argument to the function, i.e., it is not necessary for the actual and formal types to be equal.

(FuncName : (t1;...; tn) : t) in G
G |- Exp1: t'1      t'1 <: t1
···
G |- Expn: t'n      t'n <: tn

G |- FuncName (Exp1;...; Expn) : t

The next rule states that is always permissible to explicitly type an expression with a type t' that is a supertype of the expression's type t. In programming-language terminology, this operation is sometimes called an ``upcast''.

G |- Exp : t'      t' <: t

G |- (Exp : t) : t

The error expression always has the empty choice type.


G |- error : Ø

4.5 Operators

The complete set of operators are not yet defined, so it is not possible to give the typing rules for operators. A joint task force on operators with members from the XSLT, XML Schema, and XML Query working groups is chartered to define operators. The Algebra will adopt the operators defined by that group (See [Issue-0056:  Operators on Simple Types]). In general, however, arithmetic operators will have a type rule such as the following, in which t1 and t2 are numeric types and appropriate conversion exist between the two:

G |- Exp1 : p1      G |- Exp2 : p2

G |- Exp1 Oparith Exp2 : t

Equality and inequality operators are typed similarly.

G |- Exp1 : t1      G |- Exp2 : t2

G |- Exp1 Opeq Exp2 : Boolean

Boolean operators require that their subexpressions be boolean expressions.

G |- Exp1 : Boolean      G |- Exp2 : Boolean

G |- Exp1 Opbool Exp2 : Boolean

4.6 Built-in functions

The following type rules define the input and output types for built-in functions.

G |- Exp : t

G |- count( Exp) : Integer

The next two rules are for comment and processing instruction constructors.

G |- Exp : String

G |- comment(Exp) : Comment

G |- Exp1 : NCName      G |- Exp2 : String

G |- processing_instruction(Exp1; Exp2) : PI

The next two rules extract the type of an attribute or element's name from its type.

G |- Exp : @Wild[t]

G |- name(Exp) : Wild

G |- Exp : Wild[t]

G |- name(Exp) : Wild

The nodes operator only applies to values with attribute or element type.

G |- Exp : @Wild[t]

G |- nodes(Exp) : t

G |- Exp : Wild[t]

G |- nodes(Exp) : t

The next two rules are for name expressions: they extract the constituent parts of a name.

G |- Exp : Wild

G |- namespace(Exp) : URI

G |- Exp : Wild

G |- localname(Exp) : NCName

The next rule is for the parent function, which may be applied to any unit type:

G |- Exp : u

G |- parent(Exp) : AnyElement{0,1}

The next rule two rules are for the reference constructor and dereference function. Note that ref requires that its argument have a unit type.

G |- Exp : u

G |- ref(Exp) : ref(u)

G |- Exp : ref(t)

G |- deref(Exp) : t

The next rule two rules are for the target and value accessor function on comments and processing instructions.

G |- Exp : PI

G |- target(Exp) : NCName

G |- Exp : (PI | Comment)

G |- value(Exp) : String

4.7 Typing unordered forests

Many of the operators in the algebra disregard the relative order of components in their input type. This includes the built-in operators index, sort, bag.

For example, consider the following.

    index (children (book0))
==> pair [ fst [ 1 ], snd [ title [ "Data on the Web" ] ], 
    pair [ fst [ 2 ], snd [ author [ "Abiteboul" ] ] ], 
    pair [ fst [ 3 ], snd [ author [ "Buneman" ] ] ], 
    pair [ fst [ 4 ], snd [ author [ "Suciu" ] ] ], 
    pair [ fst [ 5 ], snd [ year [ 1999 ] ]

How should we describe the type of the result? By nesting the children of book0 under snd the original sequence of title, author+, year gets lost. snd can contain either a title, an author, or a year. More formally, we need to find q, m, and n such that

 children(book0) : q{m, n} 

and then the type is given by

 index(children(book0)) : pair [ fst [ String ], snd [q] ]{m, n} 

In the case of books, the values of q are:

 title [ String ] | author [ String ] | year [ Integer] 

the value of m is 3 (because there will be one title, at least one author, and one year element) and the value of n is * (because there may be any number of author elements).

We call a type like q a prime type. In general, it may contain scalar, attribute, element, choice, and empty choice types, but it will not contain repetition, sequence, or empty sequence types (except, perhaps, within an element or attribute type). The definition of prime types appears in [Figure 1].


factor (p) = p {1, 1}
factor (Wild[t]) = Wild[t]{1, 1}
factor (@Wild[t]) = @Wild[t]{1, 1}
factor (t1 , t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti)
factor (t1 & t2) = (q1| q2){m1+ m2, n1+ n2} where qi{mi, ni} = factor (ti)
factor (t1| t2) = (q1| q2){m1min m2, n1max n2} where qi{mi, ni} = factor (ti)
factor (t {m, n}) = q {m' · m, n' · n} where q {m', n'} = factor (t)
factor (()) = Ø{0, 0}
factor (Ø) = Ø{*, 0}
 


Figure 6: Definition of factoring

factor, as shown in [Figure 6], converts any type t to a type of the form q {m, n}, where t <: q {m, n}, so that any value that has type t also has type q {m, n}. For example,

factor ( title[String], author[String]{1, *}, year[Integer] )
  = (title[String] | author[String] | year[Integer]){3, *}

We can see here that the factored type is less specific than the unfactored type. For mnemonic convenience we write q {m, n} = factor (t), but one should actually think of the function as returning a triple consisting of a prime type q and two bounds m and n.

Just as factoring a number yields a product of prime numbers, factoring a type yields a repetition of prime types. Further, the result yielded by factoring is in some sense optimal. If q {m, n} = factor (t) then t <: q {m, n} and for any q', m', and n' such that t <: q'{m', n'} we have that q <: q' and m >= m' and n <= n'. Also, if t = t', then factor (t) = factor (t'). In particular, the choice of the lower bound * for factor (Ø) guarantees that factor (t) = factor (t | Ø), since m min * = m.

Using factor we can type index, sort and the operations on unordered forests bag, union, difference, and intersection as follows. Note that factor is only used by the type inference. rules, and thus is not part of the algebra expressions.

The type rule for index requires that its argument be a factored type. The second expression above the judgement line converts t into a factored type.

G |- Exp : t      q {m, n} = factor (t)

G |- index(Exp) : pair [ fst [ Integer ], snd [ref(q)]]{m, n}

The types of aggregated expressions must be factored, and their prime type must be a numeric type.

G |- Exp : t      q{m, n} = factor (t)      q <: Integer | Float | Decimal, etc.

G |- agg Exp : Integer
agg is one of avg, max, min, sum.

bag transforms an ordered forest to an unordered forest, and accordingly transforms the input type to its factor.

G |- Exp : t      q{m,n} = factor(t)

G |- bag(Exp) : {q{m,n}}

sort returns the factor of its input type. We distinguish between two kinds of sorts. Stable sort operates on ordered forests, unstable sort operates on unordered forests. In both cases, the result is an ordered forest. The result type for the unstable sort needs not be explicitly factored, because the input type is already factored. The key expression Exp2 is typed in the extended environment G |- Var : q.

G |- Exp1 : t1      q{m,n} = factor(t1)      G, Var : q |- Exp2 : t2

G |- sort Var in Exp1 by Exp2: q{m,n}

G |- Exp1 : {q{m,n}}      G, Var : q |- Exp2 : t2

G |- sort Var in Exp1 by Exp2: q{m,n}

Ed. Note: MF, Oct 23/2000: This definition assumes that the equality operator on t2 is defined. An alternative is requiring Exp2 to have AnyScalar, but that seems too restrictive.

distinct_node removes all duplicate nodes from its input. Therefore, the lower bound of the cardinality of the factor type can be at most 1. As with sort, we distinguish two kinds of distinct_node. Stable distinct_node operates on an ordered forest, and returns an ordered forest that preserves the relative order of nodes in the original input, and unstable distinct_node operates on and returns an unordered forest and, naturally, does not preserve input order.

G |- Exp : t      q{m,n} = factor(t)

G |- distinct_node(Exp) : q{min 1 m, n}

G |- Exp : {q{m,n}}

G |- distinct_node(Exp) : {q{min 1 m,n}}

Ed. Note: PF (Jan 29, 2000): a more accurate lower bound may be derived by counting the disjoint constituent types of q. But this is probably too complex.

distinct_value removes all duplicate values from its input. The typing rules are analogous to the typing rules of distinct_node.

G |- Exp : t      q{m,n} = factor(t)

G |- distinct_value(Exp) : q{min 1 m, n}

G |- Exp : {q{m,n}}

G |- distinct_value(Exp) : {q{min 1 m,n}}

Sequence "," on unordered forests is interpreted as disjoint union.

G |- Exp1 : {q1{m1, n1}}      G |- Exp2 : {q2{m2, n2}}

G |- Exp1 , Exp2 : {(q1 | q2){m1 + m2, n1 + n2}}

Similar to distinct_node, intersection (/\) and difference (-) both need to set the lower bound of the cardinality of the input type to 0. The upper bound of the cardinality of the intersection of two types can be at most the minimum of upper bounds of their individual cardinalities.

G |- Exp1 : {q1{m1, n1}}      G |- Exp2 : {q2{m2, n2}}

G |- Exp1 /\ Exp2 : {(q1 /\ q2){0, min n1 n2}}

G |- Exp1 : {q1{m1, n1}}      G |- Exp2 : {q2{m2, n2}}

G |- Exp1 - Exp2 : {q1{0,n1}}

Note that for Exp1 /\ Exp2 : t and Exp1, Exp2 - (Exp1 - Exp2) - (Exp2 - Exp1) : t', t <: t'. Thus, although the operational semantics of these two expressions is equivalent, their static semantics is not necessarily equivalent.

Ed. Note: PF, Jan 29 2000: If this is considered a problem, we can treat /\ as a derived operator, losing some precision at type level.

4.8 Iteration expressions

The typing of for expressions is rather subtle. We give an intuitive explanation first and then the detailed typing rules below.

A unit type is defined in [Figure 1]; it is either an atomic type, an attribute or element type with a constant or wildcard name, a comment, or a processing instruction. A for expression

for Var in Exp1 do Exp2

is typed as follows. First, one finds the type of expression Exp1. Next, for each unit type in this type one assumes the variable Var has the unit type and one types the body Exp2. Note that this means we may type the body of Exp2 several times, once for each unit type in the type of Exp1. Finally, the types of the body Exp2 are combined, according to how the types were combined in Exp1. That is, if the type of Exp1 is formed with sequencing, then sequencing is used to combine the types of Exp2, and similarly for choice or repetition.

For example, consider the following expression, which selects all author elements from a book.

    for c in nodes(book0) do
      match c
        case a : author[AnyType] do a
        else ()

The type of nodes(book0) is

    title[String], year[Integer], author[String]{1,*}

This is composed of three unit types, and so the body is typed three times.

assuming c has type title[String] the body has type ()
" year[Integer] " ()
" author[String] " author[String]

The three result types are then combined in the same way the original unit types were, using sequencing and iteration.

    (), (), author[String]{1,*}

as the type of the iteration, and simplifying yields

    author[String]{1,*}

as the final type.

As a second example, consider the following expression, which selects all title and author elements from a book, and renames them.

    for c in book0/* do
      match c
        case t : title[String]  do titl [ t/data() ]
        case y : year[Integer]  do ()
        case a : author[String] do auth [ a/data() ]
        else error()

Again, the type of book0/* is

    title[String], year[Integer], author[String]{1,*}

This is composed of three unit types, and so the body is typed three times.

assuming c has type title[String] the body has type titl[String]
" year[Integer] " ()
" author[String] " auth[String]

The three result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields

    titl[String], (), auth[String]{1,*}

as the type of the iteration, and simplifying yields

    titl[String], auth[String]{1,*}

as the final type. Note that the title occurs just once and the author occurs one or more times, as one would expect.

As a third example, consider the following expression, which selects all basic parts from a sequence of parts.

    for p in part0/subparts/* do
      match p
        case b : Basic     do b
        case c : Composite do ()
        else error()

The type of part0/subparts/* is

    (Basic | Composite){1,*}

This is composed of two unit types, and so the body is typed two times.

assuming p has type Basic the body has type Basic
" Composite " ()

The two result types are then combined in the same way the original unit types were, using sequencing and iteration. This yields

    (Basic | ()){1,*}

as the type of the iteration, and simplifying yields

    Basic{0,*}

as the final type. Note that although the original type involves repetition one or more times, the final result is a repetition zero or more times. This is what one would expect, since if all the parts are composite the final result will be an empty sequence.

In this way, we see that for expressions can be combined with match expressions to select and rename elements from a sequence, and that the result is given a sensible type.

In order for this approach to typing to be sensible, it is necessary that the unit types can be uniquely identified. However, the type system given here satisfies the following law.

 a [t1 | t2] = a [t1] | a [t2] 

This has one unit type on the left, but two distinct unit types on the right, and so might cause trouble. Fortunately, our type system inherits an additional restriction from XML Schema: we insist that the regular expressions can be recognized by a top-down deterministic automaton. In that case, the regular expression must have the form on the left, the form on the right is outlawed because it requires a non-deterministic recognizer. With this additional restriction, there is no problem.

The method of translating projection to iteration described in the previous section combined with the typing rules given here yield optimal types for projections, in the following sense. Say that variable x has type t, and the projection x / a has type t'. The type assignment is sound if for every value of type t, the value of x / a has type t'. The type assignment is complete if for every value y of type t' there is a value x of type t such that x / a = y. In symbols, we can see that these conditions are complementary.

sound: forall x in t. exist y in t'. x / a = y
complete: forall y in t'. exist x in t. x / a = y

Any sensible type system must be sound, but it is rare for a type system to be complete. But, remarkably, the type assignment given by the above approach is both sound and complete.

The type rule for for expressions uses the following auxiliary judgement. We write G |- for Var : t do Exp : t', if in environment G when the bound variable Var of an iteration has type t then the body Exp of the iteration has type t'.

G, Var : t |- Exp : t'

G |- for Var : t do Exp : t'


G |- for Var : () do Exp : ()

G |- for Var : t1 Exp : t'1      G |- for Var : t2 Exp : t'2

G |- for Var : t1, t2 do Exp : t'1, t'2

G |- for Var : t1 Exp : t'1      G |- for Var : t2 Exp : t'2

G |- for Var : t1 & t2 do Exp : t'1 & t'2


G |- for Var : Ø do Exp : Ø

G |- for Var : t1 Exp : t'1      G |- for Var : t2 Exp : t'2

G |- for Var : t1 | t2 do Exp : t'1 | t'2

G |- for Var : t Exp : t'

G |- for Var : t{m,n} do Exp : t'{m,n}

Given the above rules, the type rules for for expressions are immediate. Again, we need to distinguish between ordered and unordered forests.

G |- Exp1 : t1      G |- for Var : t1 do Exp2 : t2
t1 ¹ {f}

G |- for Var in Exp1 do Exp2 : t2

G |- Exp1 : {f1}      G |- for Var : f1 do Exp2 : {f2}

G |- for Var in Exp1 do Exp2 : {f2}

Iteration over an ordered forests produces an ordered forest. Iteration over an unordered forest produces an unordered forest.

4.9 Match expressions

The typing of match expressions is closely related to the typing of for expressions. Due to the typing rules of for expressions, it is possible that the body of an iteration is checked many times. Thus, when a match expression is checked, it is possible that quite a lot is known about the type of the expression being matched, and one can determine that only some of the clauses of the match apply. The definition of match uses the auxiliary judgments to check whether a given clause is applicable.

We write G |- case Var : t do Exp : t', if in environment G, the bound variable Var of the case has type t, then the body Exp of case has type t'. Note the type of the body is irrelevant if t = Ø.

¬(t = Ø)     G, Var : t |- Exp : t'

G |- case Var : t do Exp : t'


G |- case Var : Ø do Exp : Ø

We write G |- t <: t' else Exp: t'' if in environment G when t <: t' does not hold, then the body Exp of the match expression's else clause has type t''. Note that the type of the body is irrelevant if t < t'.

t <: t'

G |- t <: t' else Exp : Ø

¬ t <: t'      G |- Exp : t''

G |- t <: t' else Exp : t''

Given the above, it is straightforward to construct the typing rule for a match expression. Recall that we write t /\ t' for the intersection of two types.

G |- Exp0 : t0
G |- case Var1 : t0 /\ t1 do Exp1 : t'1
···
G |- case Varn : t0 /\ tn do Expn : t'n
G |- t0 < t1 | ... | tn else Expn+1 : t'n+1

G |-  
(match Exp0
    case Var1 : t1 do Exp1
    ···
    case Varn : tn do Expn
    else Expn+1)
 : t'1 | ... | t'n+1

4.10 Top-level expressions

We write G |- Query if in environment G, the query item Query is well-typed. A type declaration is always well typed:

 

G |- type x = t

A function declaration is well-typed if in the environment extended with the type assignments for its formal variables, its body is well-typed.

G, Var1 : t1, ..., Varn : tn |- Exp : t'      t' <: t

G |- fun FuncName (Var1 : t1; ...; Varn : tn) : t = Exp

A top-level let expression is well-typed if the type of its body, t', is a subtype of the bound variable's type.

G |- Exp : t'      t' <: t

G |- let Var : t = Exp

Finally, a top-level query expression is well-typed if its body is well-typed.

G |- Exp : t

G |- query Exp

We extract the relevant component of a type environment from a query item Query with the function environment (Query).

environment ( type x = t) = ()
environment ( fun FuncName (Var1: t1; ...; Varn: tn) : t) = FuncName (Var1: t1; ...; Varn: tn) : t
environment ( let Var : t = Exp) = Var : t

We write |- Query1 ... Queryn if the sequence of query items Query1 ... Queryn is well typed.

G |- environment (Query1),..., environment (Queryn)
G |- Query1      ...      G |- Queryn

|- Query1... Queryn

5 Dynamic Semantics : Value-Inference Rules

Ed. Note: MF : 02-Feb-2001: This material is not yet stable.

The Algebra's dynamic, or operational, semantics is presented as value inference rules. The value inference rules are similar to the type inference rules in [4 Static Semantics : Type-Inference Rules], but they relate expressions to values or semantic objects. The Algebra's semantic objects are defined in the [XML Query Data Model]. An operational semantics specifies the order in which an Algebra expression is evaluated and guarantees that every expression can be reduced to a simple semantic object. The Algebra's dynamic semantics is modeled on the dynamic semantics presented in [Milner].

5.1 Semantic objects


error
ainAtomicValues
ninNodeValues
uinUnitValues = AtomicValues U NodeValues
linOrderedForestValues = [ UnitValues ]
binUnorderedForestValues = { UnitValues }
dinDataModelValues = UnitValues U OrderedForestValues U UnorderedForestValues
vinValues = DataModelValues U {error}
finFunctionClosure = Exp X Env X Env


Figure 7: Semantic objects

There is a single error value, named error. AtomicValues is the class of values denoted by the Const expressions, i.e., the values contained in the XML Schema atomic datatypes. NodeValues is the class of Node values defined in the [XML Query Data Model]. UnitValues is the union of all values in AtomicValues and NodeValues; they are instances of the unit type [Figure 1]. DataModelValues is the class of values defined in the XML Query Data model, which includes UnitValues and ordered and unordered forests. Values is the class of values that includes DataModelValues and error. A FunctionClosure is the class of triples each containing an Algebra expression and two environments. All the above classes, except error, are infinite sets.

[Figure 8] lists several basic functions used in the dynamic semantics.


val : Const -> AtomicValue
def : Type-> Def_T
Defined in [XML Query Data Model]:
[]Construct an ordered forest
{}Construct an unordered forest
append Appends two ordered forests
head Returns the first node in a non-empty, ordered forest
tailReturns items in a ordered forest excluding its first item.
union Returns the union of two unordered forests
diff Returns the difference of two unordered forests
intersect Returns the intersection of two unordered forests


Figure 8: Functions used in dynamic semantics

The function val takes a constant expression Const and returns the atomic value it denotes. In [XML Query Data Model], Def_Type denotes a reference to the data-model representation of a schema type Type. The function def takes a type expression Type and returns the reference node that represents the given type. The remaining functions used in the operational semantics are defined in [XML Query Data Model].

Ed. Note: MF : (Jan-15-2001) To define the sort expression completely, we need to specify what basic sort operators are available.

5.2 Environments

The value inference rules use an environment comprised of a type environment, a value environment, and a function environment, defined in [Figure 9]


GType environment
VE in ValueEnv = Var -> Values Value environment
FE inFuncEnv = FuncName -> FunctionClosuresFunction environment
E in Env = TypeEnv U ValueEnv U FuncEnv


Figure 9: Environments

The type environment G is defined by the static type rules [4 Static Semantics : Type-Inference Rules]. The value environment VE is a finite map from variables to values. The function environment FE is a finite map from function names to function closures. An environment E is a tuple containing a type enviroment, a value environment, and a function environment.

To select the type-environment component of an environment, we use the notation G of E; similarly, we use VE of E to select the value environment and FE of E for the function environment.

We look up the type of a variable by writing (G of E)(Var) = t, and we write (VE of E)(Var) = v to look up the value of a variable or (FE of E)(FuncName) = f to look up a function.

It is often necessary to modify an environment, for example, when defining variables or functions. The modification of one environment E by another E' is written E, E' and it denotes:
E, E' = ((G of E), (G' of E'); (VE of E), (VE' of E'); (FE of E), (FE' of E'))

The finite map f, g is the finite map with domain Dom f U Dom g, and values

  (f, g)(a) = if a in Dom g then g(a) else f(a)
This definition means that when ``looking up'' a variable in the environment E, E', the environment E' will always be searched first.

5.3 Expressions

Each rule of the dynamic semantics are inferences among sentences of the form: E |- phrase => v, where E is an environment, phrase is a phrase in the core syntax [Figure 2], and v is a semantic object.

As in [4 Static Semantics : Type-Inference Rules], when all judgements above the line of an inference rule hold, then the judgement below the line holds as well. The folowing rule states that in any environment, a constant expression reduces to the value it denotes:


E |- Const => val(Const)

The following rule states that an unqualified LocalName denotes a QName value with a null-valued URI reference. The function qnameValue is a constructor in the [XML Query Data Model]; it constructs a new QName value.

E |- LocalName => v

E |- LocalName => qnameValue(null, v, def(QName))

The next rule constructs a fully qualified QName by evaluating its literal namespace and local-name components and then applying the QName constructor to the two values:

E |- Namespace => v1      E |- LocalName => v2

E |- {Namespace}LocalName => qnameValue(v1, v2, def(QName))

The next rule constructs a fully qualified QName by evaluating its computed namespace and local-name components and then applying the QName constructor to the two values:

E |- Exp1 => v1      E |- Exp2 => v2

E |- {Exp1}Exp2 => qnameValue(v1, v2, def(QName))

The next rule determines the value of a variable, which is simply the variable's value in the current value environment VE

(VE of E)(Var) = v

E |- Var => v

The next rule constructs an attribute node from its name and value sub-expressions.

E |- NameExpr => v1      E |- Exp => v2

E |- @NameExp[Exp] => attrNode(v1, v2)

The next rule constructs an element node from its name and children sub-expressions. Before constructing the element node, a ``deep'' copy of the children is created; this enforces the data-model constraint that the new node be the parent of all of its children. This is the first rule that uses the type environment: the run-time type def(t) of the element is obtained from the type environment G.

E |- NameExpr => v1      E |- Exp => v2      G of E |- NameExp[Exp] : t

E |- NameExp[Exp] => elemNode(v1, {}, copy([v2]), def(t))

Ed. Note: MF : (Jan-08-2001) NB: 1. The data model now has one elemNode constructor that takes a list of nodes containing attributes and children. 2. When creating a copy of the children, each new child should have the new element defined as its parent.

The next rule constructs a sequence. It requires that Exp1 and Exp2 be "ordered" values, i.e., values with prime type or not bags. The side conditions that check the expressions' static type are necessary because the ',' operator is overloaded : it operates on both ordered and unordered collections.

E |- Exp1 => l1      G of E |- Exp1 : t1      t1 = q or ¬ (t1 = { t1 })
E |- Exp2 => l2      G of E |- Exp2 : t2      t2 = q or ¬ (t2 = { t2 })

E |- Exp1, Exp2 => append([l1], [l2])

We note here that in the [XML Query Data Model], the ordered forest constructor [] when applied to an ordered forest is the identity function, i.e., it returns the original forest. Similarly, the unordered forest constructor {} is the identity function on unordered forests.

The next rule constructs the union of two bags. It requires that Exp1 and Exp2 be "unordered" values, i.e., values with prime type or bags, because the ',' operator is overloaded.

E |- Exp1 => b1      G of E |- Exp1 : t1      t1 = q or t1 = { t1 }
E |- Exp2 => b2      G of E |- Exp2 : t2      t2 = q or t2 = { t2 }

E |- Exp1, Exp2 => union({b1}, {b2})

The next two rules construct the difference, and intersection of two bags. They do not require any side conditions because the operators are not overloaded.

E |- Exp1 => v1      E |- Exp2 => v2

E |- Exp1 - Exp2 => diff({v1}, {v2})

E |- Exp1 => v1      E |- Exp2 => v2

E |- Exp1 /\ Exp2 => intersect({v1}, {v2})

The next rule constructs the empty sequence, which is always an ordered forest.


E |- () => []

The next rule evaluates a conditional expression. If the conditional's boolean expression Exp1 evaluates to true, Exp2 is evaluated and its value is produced. If the conditional's boolean expression evaluates to false, Exp3 is evaluated and its value is produced.

E |- Exp1 => true      E |- Exp2 => v2

E |- if Exp1 then Exp2 else Exp3 => v2

E |- Exp1 => false      E |- Exp3 => v3

E |- if Exp1 then Exp2 else Exp3 => v3

The next rule evaluates a local binding expression: it evaluates the body of the expression Exp2 in the environment E extended with the variable Var bound to the value v.

E |- Exp1 => v      E, { Var |-> v } |- Exp2 => v2

E |- let Var = Exp1 in Exp2 => v2

The next rule evaluates a function application. It evaluates all the function's arguments in the current environment, then extracts the definition of the function's closure from the function environment. It then evaluates the body of the function applied to its arguments in the environment E' provided in its closure.

E |- Exp1 => v1      ···     E |- Expn => vn
(FE of E)(FuncName) = (f, E', VE)      E', Rec VE |- f v1 · · · vn => v

E |- FuncName( Exp1; · · · ;Expn) => v

Ed. Note: MF (Jan-08-2001) Define and explain the Rec function on environments and why its needed for (possibly) recursive function applications.

The next rule states that in any environment, the error expression evalutes to the error value.


E |- error => error

5.4 Operators

The Algebra's two equality operators, '=' and '==', are defined in [XML Query Data Model].

E |- Exp1 => v1      E |- Exp2 => v2      v3 = v1 Opeq v2

E |- Exp1 Opeq Exp2 => v3

The next two rule define the semantics of the boolean operators and, or, and not.

E |- Exp1 => v1      E |- Exp2 => v2      v3 = apply(Opbool, v1, v2)

E |- Exp1 Opbool Exp2 => v3

E |- Exp1 => v1      E |- Exp2 => v2      v3 = apply(not, v1, v2)

E |- Exp1 not Exp2 => v3

For the boolean operators and, or, and not the apply function is defined as expected:

  apply(and, true, true)   = true
| apply(and, true, false)  = false
| apply(and, false, true)  = false
| apply(and, false, false) = false
| apply(or, true, true)    = true
| apply(and, true, false)  = true
| apply(and, false, true)  = true
| apply(and, false, false) = false
| apply(not, true)         = false
| apply(not, false)        = true

We have not defined the semantics of all the arithmetic operators in the Algebra. A joint task force on operators with members from the [XSLT 99], XML Schema, and XML Query working groups is chartered to define arithmetic operators. The Algebra will adopt the decisions of that group (See [Issue-0056:  Operators on Simple Types]). In the following two rules, we assume that the binary and unary operators are implemented by the apply function whose semantics are defined by the operator task force.

E |- Exp1 => v1      E |- Exp2 => v2      v3 = apply(Oparith, v1, v2)

E |- Exp1 Oparith Exp2 => v3

E |- Exp1 => v1      v2 = apply(Op, v1)

E |- Op Exp1 => v2

5.5 Built-in functions

There is a near one-to-one correspondence between the Algebra expressions for constructing and accessing data model values [Figure 4] and the data model constructors and accessors.

The next three rules construct comment, processing instruction, and reference nodes, respectively.

E |- Exp => v

E |- comment(Exp) => commentNode(v)

E |- Exp1 => v1      E |- Exp2 => v2

E |- processing_instruction(Exp1, Exp2) => piNode(v1, v2)

E |- Exp => v

E |- ref(Exp) => ref(v)

A single rule defines all of the accessor and other data-model functions. In the following rule, the function name f denotes any one of the following accessors: bagtolist, deref, listtobag, localname, namespace, name, nodes, parent, value, target.

E |- Exp => v

E |- f(Exp) => f(v)

The sort expression can return any possible permutation of its input. Stable sort operates on ordered forests, unstable sort operates on unordered forests. In both cases, the result is an ordered forest.

The easiest way to explain the dynamic semantics of sort is by assuming the existence of two functions: stable_sort_pairs defined on ordered forests and unstable_sort_pairs defined on unordered forests. Each function takes a forest of pair elements that contain pairs of values and sorts in increasing order the pair elements based on the value in the fst subelement. The purpose of is stable_sort_pairs and unstable_sort_pairs is only to explain the semantics of the sort expression. An implementation of the Algebra is free to implement the sort expression as it chooses, as long as it guarantees stability on ordered forests.

E |- Exp1 => v1      G of E |- Exp1 : t1      t1 = q or ¬ (t1 = { t1 })
E |- for Var in v1 do pair [ fst [ Exp2 ], snd [ Var ] ] => v2
E |- for Var' in stable_sort_pairs(v2) do Var'/snd/* => v3

G |- sort Var in Exp1 by Exp2 => v3

E |- Exp1 => v1      G of E |- Exp1 : t1      t1 = q or (t1 = { t1 })
E |- for Var in v1 do pair [ fst [ Exp2 ], snd [ Var ] ] => v2
E |- for Var' in unstable_sort_pairs(v2) do Var'/snd/* => v3

G |- sort Var in Exp1 by Exp2 => v3

Ed. Note: MF (Jan-09-2001) Need definition for built-in function string(Exp)

5.6 Iteration expressions

The next two rules evaluate iteration expressions. If the iteration expression evaluates to the empty ordered forest, then the entire expression evalutes to the empty ordered forest.

E |- Exp1 => []

E |- for Var in Exp1 do Exp2 => []

The next rule first evaluates the iteration expression Exp1, which produces the value v. It evaluates the body of the iteration expression in the environment E extended with the variable Var bound to the head of the ordered forest v, which produces v2. It evaluates the entire iteration expression applied to the tail of the ordered forest v, which produces v3. Since the values v2 and v3 may be of any type, the sequence/union operator is applied to produce the value of the complete iteration expression.

E |- Exp1 => v      E, { Var |-> head([v]) } |- Exp2 => v2
E |- for Var in tail([v]) do Exp2 => v3      E |- v2, v3 => v4

E |- for Var in Exp1 do Exp2 => v4

Ed. Note: MF : Jan-09-2001. The rule above works for lists, but not for bags, because head/tail only defined on lists. Need a second rule for bags.

5.7 Match expressions

When evaluating a match expression, the value v to match occurs on the left of the turnstile |-. Each case rule of a match expression is always evaluated against this value. Alternative case rules are tried from left to right. The rule for the match expression evaluates its expression and sets up the appropriate environment for the case rules:

E |- Exp => v      E, v |- CaseRules => v1

E, v |- match Exp CaseRules => v1

The next rule evaluates the body of a case rule if the intersection of the dynamic or run-time type of v and the type expression Type is not empty. It extends the environment by binding the variable Var to v.

type(v) /\ Type != Ø      E, { Var |-> v } |- Exp => v2

E, v |- case Var : Type do Exp CaseRules => v2

The next rule evaluates the case rules following the current one if the intersection of dynamic or run-time type of v and the type expression Type is empty. The body of the given case rule is not evaluated if the intersection is empty.

(type(v) /\ Type) = Ø      E, v |- CaseRules => v2

E, v |- case Var : Type do Exp CaseRules => v2

The next rule states that the else branch of a match expression always evaluates to its given expression.

E |- Exp => v1

E, v |- else Exp => v1

5.8 Top-level expressions

The top-level type declaration has no effect on the dynamic environment. It is only used during static type checking.

The function declaration and top-level let expressions extend the global environment E.


E |- fun FuncName (Var1 : Type1; · · · Varn : Typen) : Type = Exp => E, { FuncName |-> (Exp, E, {}) }

E |- Exp => v

E |- let Var = Exp => E, { Var |-> v }

The top-level query expression always evaluates to a value, which is returned to the programming environment in which the Algebra expression is evaluated. It has no effect on the global environment E

E |- Exp => v

E |- query Exp => v, E

A sequence of query expressions evaluates to a value and a global environment.

|- Query1 => v1, E1      E1 |- Query2 => v2, E2      · · ·      En-1 |- Queryn => vn, En

Query1 · · · Queryn -> vn, En

Ed. Note: MF : Jan-09-2000 This definition assumes that the order of the global declarations is significant, that is, a global variable declaration is in scope for those query expressions that follow it, but not for those that precede it. For mutually recursive declarations, this needs to be changed.

6 References

BFS00
P. Buneman, M. Fernandez, D. Suciu. UnQL: A query language and algebra for semistructured data based on structural recursion. VLDB Journal, April, 2000, Vol 9, Number 1.
BK93
Catriel Beeri and Yoram Kornatzky. Algebraic Optimization of Object-Oriented Query Languages. Theoretical Computer Science 116(1&2):59--94, August 1993.
BKD90
Francois Bancilhon, Paris Kanellakis, Claude Delobel. Building an Object-Oriented Database System. Morgan Kaufmann, 1990.
BNTW95
Peter Buneman, Shamim Naqvi, Val Tannen, Limsoon Wong. Principles of programming with complex object and collection types. Theoretical Computer Science 149(1):3--48, 1995.
Quilt
Don Chamberlin, Jonathan Robie, and Daniela Florescu. Quilt: An XML Query Language for Heterogeneous Data Sources. International Workshop on the Web and Databases (WebDB'2000), Dallas, Texas, May 2000.
CM93
S. Cluet and G. Moerkotte. Nested queries in object bases. Workshop on Database Programming Languages, pages 226--242, New York, August 1993.
YAT99
S. Cluet, S. Jacqmin and J. Siméon The New YaTL: Design and Specifications. Technical Report, INRIA, 1999.
Col90
L. S. Colby. A recursive algebra for nested relations. Information Systems 15(5):567-582, 1990.
Date97
Hugh Darwen (Contributor) and Chris J. Date. Guide to the SQL Standard : A User's Guide to the Standard Database Language SQL Addison-Wesley, 1997.
XMLQL99
A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A query language for XML. In International World Wide Web Conference, 1999.
Graefe93
Goetz Graefe, Query Evaluation Techniques for Large Databases. In ACM Computing Surveys, 25(2):73--170, 1993.
LW97
Leonid Libkin and Limsoon Wong. Query languages for bags and aggregate functions. Journal of Computer and Systems Sciences, 55(2):241--272, October 1997.
HP2000
Haruio Hosoya, Benjamin Pierce, XDuce : A Typed XML Processing Language (Preliminary Report) WebDB Workshop 2000.
LMW96
Leonid Libkin, Rona Machlin, and Limsoon Wong. A query language for multi-dimensional arrays: Design, implementation, and optimization techniques. SIGMOD 1996.
Milner
R. Milner, M. Tofte, R. Harper, D. MacQueen The Definition of Standard ML (Revise). MIT Press, 1997.
Mitchell
John C. Mitchell Foundations for Programming Languages. MIT Press, 1998.
Mog89
E. Moggi, Computational lambda-calculus and monads. In Symposium on Logic in Computer Science Asilomar, California, IEEE, June 1989.
Mog91
E. Moggi, Notions of computation and monads. Information and Computation, 93(1), 1991.
MSL
A. Brown, M. Fuchs, J. Robie, P. Wadler, editors, MSL : A Model for W3C XML Schema. October, 2000.
OCaml
The Caml Language. See http://pauillac.inria.fr/caml/.
XQL99
J. Robie, editor. XQL '99 Proposal, 1999. See http://www.ibiblio.org/xql/xql-proposal.html.
Wad92
Philip Wadler. Comprehending monads. Mathematical Structures in Computer Science, 2:461-493, 1992.
Wad93
P. Wadler, Monads for functional programming. In M. Broy, editor, Program Design Calculi, NATO ASI Series, Springer Verlag, 1993. Also in J. Jeuring and E. Meijer, editors, Advanced Functional Programming, LNCS 925, Springer Verlag, 1995.
Wad95
P. Wadler, How to declare an imperative. ACM Computing Surveys, 29(3):240--263, September 1997.
XML Names
World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names/ .
XML
World-Wide Web Consortium Extensible Markup Language (XML) 1.0 (Second edition), October, 2000 See http://www.w3.org/TR/REC-xml.
XSLT 99
World-Wide Web Consortium XSL Transformations (XSLT), Version 1.0. W3C Recommendation, November 1999. See http://www.w3.org/TR/xslt.
XML Query Data Model
World-Wide Web Consortium XML Query Data Model, Working Draft, May 2000. See http://www.w3.org/TR/query-datamodel/.
XPath
World-Wide Web Consortium XML Path Language (XPath) : Version 1.0. November, 1999. See http://www.w3.org/TR/xpath.html.
XML Schema Part 1
World-Wide Web Consortium XML Schema Part 1 : Structures, Working Draft. April 2000. See http://www.w3.org/TR/xmlschema-1/
XML Schema Part 2
World-Wide Web Consortium XML Schema Part 2 : Datatypes, Working Draft, April 2000. See http://www.w3.org/TR/xmlschema-2/.

A Equivalences

A.1 Relating projection to iteration

The examples use the / operator liberally, but in fact we use / as a convenient abbreviation for expressions built from lower-level operators: for expressions, the nodes function, and match expressions.

For example, the expression:

    book0/author

is equivalent to the expression:

    for v in nodes(book0) do
      match v 
        case a : author[AnyComplexType] do a
        else ()

Here the nodes() function returns a forest consisting of the content of the element book0, namely, a title element and three author elements (the order is preserved). The for expression binds the variable v successively to each of these elements. Then the match expression selects a branch based on the type of v. If it is an author element then the first branch is evaluated, otherwise the second branch. If the first branch is evaluated, it returns a. The variable a contains the same value as v, but the type of a is author[String], which is the intersection of the type of v and the type author[AnyComplexType]. If the second branch is evaluated, then the branch returns (), the empty sequence.

To compose several expressions using /, we again use for expressions. For example, the expression:

    bib0/book/author

is equivalent to the expression:

    for b in nodes(bib0) do
      match b
        case b : book[AnyComplexType] do
          for d in nodes(b) do
            match d
              case a : author[AnyComplexType] do a
              else ()
        else ()

The for expression iterates over all book elements in bib0 and binds the variable b to each such element. For each element bound to b, the inner expression returns all the author elements in b, and the resulting forests are concatenated together in order.

In general, an expression of the form e / a is converted to the form

    for v1 in e do
      for v2 in nodes(v1) do
        match v2
          case v3 : a[AnyComplexType] do v3
          else ()

where e is an expression, a is an element name, and v1, v2, and v3 are fresh variables (ones that do not appear in the expression being converted).

According to this rule, the expression bib0/book translates to

    for v1 in bib0 do
      for v2 in nodes(v1) do
        match v2
          case v3 : book[AnyComplexType] do v3 
          else ()

In [A Equivalences], we discuss laws which allow us to simplify this to the previous expression:

    for v2 in nodes(bib0) do
      match v2
        case v3 : book[AnyComplexType] do v3 
        else ()

Similarly, the expression bib0/book/author translates to

    for v4 in (for v2 in nodes(bib0) do
                 match v2
                   case v3 : book[AnyComplexType] do v3 
                   else ()) do
      for v5 in nodes(v4) do
        match v5
          case v6 : author[AnyComplexType] do v6
          else ()

Again, the laws will allow us to simplify this to the previous expression

    for v2 in nodes(bib0) do
      match v2
        case v3 : book[AnyComplexType] do
          for v5 in nodes(v3) do
            match v5
              case v6 : author[AnyComplexType] do v6
              else ()
        else ()

These examples illustrate an important feature of the Algebra: high-level operators may be defined in terms of low-level operators, and the low-level operators may be subject to algebraic laws that can be used to further simplify the expression.

A.2 Reducible expressions

[Figure 10]  contains the laws that relate the reducible expressions (i.e., those labeled with "*" in [Figure 2]) to equivalent expressions.

In Rule 1, the projection expression e / Wild is rewritten as a for expression, which binds v to each element in the forest e, and returns the children elements of v whose name is in Wild. Similarly, Rule 2 rewrites the attribute projection expression e / @Wild as a match embedded in a for expression.

Rule  3 rewrites the e / processing_instruction() expression by iterating over items in e. The nested match expression returns e's children if e is a PI. Otherwise, it returns the empty sequence. Rules 4 and 5 are analogous for comments and atomic data.

Rule 6 rewrites the e / deref() expression by iterating over items in e. The nested match expression returns e's children if e is ref(AnyTree). Otherwise, it returns the empty sequence.

Rule 7 simply rewrites a where expression by an if - then - else expression.

Rule  8 rewrites empty(e) as a match expression that returns true if the type of e is the empty sequence.

Rule 9 rewrites cast e : t as a match expression that returns the value of e with type t, if it has that type at runtime, otherwise it raises an error.

Rule 10 rewrites the let expression with a type as a let expression without a type by moving the type constraint into the expression.


e / Wild => for v1 in e do
  for v2 in nodes(v1) do
      match v2 case v3 : Wild[AnyComplexType] do v3 else () (1)
 
e / @Wild => for v1 in e do
  for v2 in nodes(v1) do
      match v2 case v3 : @Wild[AnySimpleType] do v3 else () (2)
 
e / data() => for v in nodes(e) do
match v
    case v1 : AnySimpleType do v1
    else ()
(3)
 
e / processing_instruction() => for v in nodes(e) do
match v
    case v1 : PI do v1
    else ()
(4)
 
e / comment() => for v in nodes(e) do
match v
    case v1 : Comment do v1
    else ()
(5)
 
e / deref() => for v in nodes(e) do
match v
    case v1 : ref(AnyTree) do deref(v1)
    else ()
(6)
 
where e1 do e2 => if e1 then e2 else () (7)
 
empty(e) => match e case v : () do true else false (8)
 
cast e : t => match e case v : t do v else error (9)
 
let v : Type = e1 do e2 => let v = (e1: Type) do e2 (10)


Figure 10: Definitions

A.3 Laws

In this section, we describe some laws that hold for the Algebra. These laws are important for defining rules that simplify algebraic expressions, such as eliminating unnecessary for or match expressions.

The iteration construct of the Algebra is closely related to an important mathematical object called a monad. A monad, among other things, generalizes set, bag, and list types. In functional languages, the comprehension construct is used to express iteration over set, bag, and list types [BNTW95], [LW97]. A comprehension corresponds directly to a monad [Wad92], [Wad93], [Wad95].

The correspondence between the Algebra's iteration construct and a monad is close, but not exact. Each monad is based on a unary type constructor, such as Set{t} or List{t}, representing a homogenous set or list where all elements are of type t. In the Algebra, we have more complex and heterogenous types, such as a forest consisting of a title, a year, and a sequence of one or more authors. Also, one important component of a monad is the unit operator, which converts an element to a set or list. If x has type t, then set{x} is a unit set of type Set{t} or [x] is a unit list of type List{t}. In the Algebra, we simply write, say, author["Buneman"], which stands for both a tree and for the unit forest containing that tree.

Monads satisfy three laws, and three corresponding laws are satisfied by the the Algebra's for expression.

First, iteration over a unit forest can be replaced by substition. This is called the left unit law.

for v in e1 do e2 = e2{v := e1}

provided that e1 is a unit type (e.g., is an element or a scalar constant). We write e2{v := e1} to denote the result of taking expression e2 and replacing occurrences of the variable v by the expression e1. For example,

for v in author["Buneman"] do auth[v/data()] = auth["Buneman"]

The iteration over a forest of one item can always be eliminated using variable substitution.

Second, an iteration that returns the iteration variable is equivalent to the identity. This is called the right unit law.

for v in e do v = e

For example

for v in book0 do v = book0

An important feature of the type system described here is that the left side of the above equation always has the same type as the right side.

Third, there are two ways of writing an iteration over an iteration, both of which are equivalent. This is called the associative law.

for v2 in (for v1 in e1 do e2) do e3
= for v1 in e1 do (for v2 in e2 do e3)

For example, a projection over a forest includes an implicit iteration, so e / a = for v in e do v / a. Say we define a forest of bibliographies, bib1 = bib0, bib0. Then bib1/book/author is equivalent to the first expression below, which in turn is equivalent to the second.

for b in (for a in bib1 do a/book) do b/author
= for a in bib1 do (for b in a/book do b/author)

With nested relational algebra, the monad laws play a key role in optimizing queries. Similarly, the monad laws can also be exploited for optimization in this context.

For example, if b is a book, the following finds all authors of the book that are not Buneman:

  for a in b do
    where a/data() != Buneman do
      a

If l is a list of authors, the following renames all author elements to auth elements:

  for a' in l do
    auth[ a'/data() ]

Combining these, we select all authors that are not Buneman, and rename the elements:

  for a' in (for a in b do
    	       where a/data() != Buneman do
    		 a) do
    auth[ a'/data() ]

Applying the associative law for a monad, we get:

  for a in b do
    for a' in (where a/data() != Buneman do a) do
      auth[ a'/data() ]

Expanding the where clause to a conditional, we get:

  for a in b do
    for a' in (if a/data() != Buneman then a else ()) do
      auth[ a'/data() ]

Applying a standard law for distributing loops over conditionals gives:

  for a in b do
    if a/data() != Buneman then
      for a' in a do
        auth[ a'/data() ]
    else ()

Applying the left unit law for a monad, we get:

  for a in b do
    if a/data() != Buneman then
      auth[ a/data() ]
    else ()

And replacing the conditional by a where clause, we get:

  for a in b do
    where a/data() != Buneman do
      auth[ a/data() ]

Thus, simple manipulations, including the monad laws, fuse the two loops.

[A.1 Relating projection to iteration] ended with two examples of simplification. Returning to these, we can now see that the simplifications are achieved by application of the left unit and associative monad laws.

[Figure 11]  contains a dozen algebraic simplification laws. In a relational query engine, algebraic simplifications are often applied by a query optimizer before a physical execution plan is generated; algebraic simplification can often reduce the size of the intermediate results computed by a query evaluator. The purpose of our laws is similar -- they eliminate unnecessary for or match expressions, or they enable other optimizations by reordering or distributing computations. The set of laws given is suggestive, rather than complete.


E ::= if [] then e1 else e2
  | let v = [] in e
  | for v in [] do e
  | match []
  case v : t do e
  ...
  case v : t do e
  else e
 
for v in () do e => () (8)
 
for v in (e1, e2) do e3 => (for v in e1 do e3), (for v in e2 do e3) (9)
 
for v in e1 do e2 => e2{e1/ v},  if e1 : u (10)
 
 
for v in e do v => e (11)
 
e1 : {t1}, e2 : {t2}, v1 free in e2
for v1 in e1 do for v2 in e2 do e3
=> for v2 in e2 do for v1 in e1 do e3 (12)
E[ if e1 then e2 else e3] => if e1 then E[e2] else E[e3] (13)
 
E[ let v = e1 do e2] => let v = e1 do E[e2] (14)
 
E[ for v in e1 do e2] => for v in e1 do E[e2] (15)
 
E[ match e1
case v : t do e2
casev : t do en-1
...
else en ]
=>
match e1
case v : t do E[ e2 ]
...
case v : t do E[ en-1 ]
else E[en ]
(16)
 


Figure 11: Optimization Laws

Rules 8, 9, and 10 simplify iterations. Rule 8 rewrites an iteration over the empty sequence as the empty sequence. Rule 9 distributes iteration through sequence: iterating over the sequence (e1, e2) is equivalent to the sequence of two iterations, one over e1 and one over e2. Rule 10 eliminates an iteration over a single element or scalar. If e1 is a unit type, then e1 can be substituted for occurrences of v in e2.

Ed. Note: MF (Oct-18-2000) The rules for eliminating trivial match expressions need to be written. They are more complex than those for the old case expressions.

Rule 11 eliminates an iteration when the result expression is simply the iteration variable v.

Rule 12 commutes a nested iteration over unordered forests. This equivalence does not hold for so called dependent joins, where the outer iteration variable v1 is bound in the inner expression e2.

Rules 13--16 are used to commute expressions. Each rule actually abbreviates a number of other rules, since the context variable E stands for a number of different expressions. The notation E[e] stands for one of the four expressions given with expression e replacing the hole [] that appears in each of the alternatives. For instance, one of the expansions of Rule 15 is the following, when e is taken to be for v in [] do e :

for v2 in (for v1 in e1 do e2) do e3 => for v1 in e1 do (for v2 in e2 do e3)

B Issues

B.1 Introduction

The issues in [B.2 Issues list] serve as a design history for this document. The ordering of issues is irrelevant. Each issue has a unique id of the form Issue-<dddd> (where d is a digit). This can be used for referring to the issue by <url-of-this-document>#Issue-<dddd>. Furthermore, each issue has a mnemonic header, a date, an optional description, and an optional resolution. For convenience, resolved issues are displayed in green. Some of the issues contain references to W3C internal archives. These are marked with "W3C-members only". Some of the descriptions of the resolved issues are obsolete w.r.t. to the current version of the document.

Ed. Note: PF (Aug-05-2000): For the sake of archival, there are some duplicate issues raised in multiple instances. Duplicate issues are marked as "resolved" with reference to the representative issue.

B.2 Issues list

Unless stated explicitly otherwise, [Issue-0001:  Attributes] through [Issue-0039:  Dereferencing semantics]  have been raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0142.html (W3C-members only).

Issue-0001:  Attributes

Date: Jul-26-2000

One example of the need for support of [Issue-0049:  Unordered Collections], but also: Attributes need to be constrained to contain white space separated lists of simple types only.

Attributes are represented by @attribute-name[content]. See [3.4 Types : abstract syntax], [3.5 Types : concrete syntax] for the constraint on white space seperated lists of simple types, and [3.6 Expressions] for selecting and constructing attributes.

Issue-0002:  Namespaces

Date: Jul-26-2000

Namespaces are represented by {uri-of-namespace}localname. See [3.1 Expanded names] and [3.2 Wildcards].

Issue-0003:  Global Order

Date: Jul-26-2000

The data model and algebra do not define a global order on documents. Querying global order is often required in document-oriented queries.

See the thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0179.html (W3C-members only).

Resolved by adding < operator defined on nodes in same document. See [2.12 Querying order]. See [Issue-0079:  Global order between nodes in different documents] for order between nodes in different documents.

Issue-0004:  References vs containment

Date: Jul-26-2000

The query-algebra datamodel currently does not explicitly model children-elements by references (other than the XML-Query Datamodel. This facilitates presentation, but may be an oversimplification with regard to [Issue-0005:  Element identity].

This issue is resolved by subsumption as follows: (1) As [Figure ] points out, all child-elements are (implicit) references to nodes. (2) Thus, having resolved [Issue-0005:  Element identity] this issue is resolved too.

Issue-0005:  Element identity

Date: Jul-26-2000

Do expressions preserve element identity or don't they? And does "=" and distinct use comparison by reference or comparison by value?

The first part of the question has been resolved by resolution of [Issue-0010:  Construct values by copy]. The second part raises a more specific issue [Issue-0066:  Shallow or Deep Equality?].

Issue-0006:  Source and join syntax instead of "for"

Date: Jul-26-2000

Another term for "source and join syntax" is "comprehension". See [A Equivalences] for a discussion of the relationship between iteration by for and comprehension syntax.

This issue is resolved by subsumption under [Issue-0021:  Syntax]. List comprehension is a syntactic alternative to "for v in e1 do e2", which has been favored by the WG in the resolution of [Issue-0021:  Syntax].

Issue-0007:  References: IDREFS, Keyrefs, Joins

Date: Jul-26-2000

Currently, the Algebra does not support reference values, such as IDREF, or Keyref (not to be mixed up with "node-references" - see [Issue-0005:  Element identity], which are defined in the XML Query Data Model. The Algebra's type system should be extended to support reference types and the data model operators ref, and deref should be supported (similar to id() in XPath).

Issue-0008:  Fixed point operator or recursive functions

Date: Jul-26-2000

It may be useful to add a fixed-point operator, which can be used in lieu of recursive functions to compute, for example, the transitive closure of a collection.

Currently, the Algebra does not guarantee termination of recursive expressions. In order to ensure termination, we might require that a recursive function take one argument that is a singleton element, and any recursive invocation should be on a descendant of that element; since any element has a finite number of descendants, this avoids infinite regress. (Ideally, we should have a simple syntactic rule that enforces this restriction, but we have not yet devised such a rule.)

Impacts optimization; hard to do static type inference; current algebra is first-order

See for the subproblem of typing "//" or desc() http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0187.html (W3C-members only).

Issue-0009:  Externally defined functions

Date: Jul-26-2000

There is no explicit support for externally defined functions.

The set of built-in functions may be extended to support other important operators.

See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0003.html (W3C-members only).

Issue-0010:  Construct values by copy

Date: Jul-26-2000

Need to be able to construct new types from bits of old types by reference and by copy. Related to [Issue-0005:  Element identity].

The WG wishes to support both: construction of values by copy, as well as references to original nodes ( http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)) See also http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only) (W3C-members only). This needs some further investigation to sort out all technical difficulties (see [Issue-0062:  Open questions for constructing elements by reference]) so the change has not yet been reflected in the Algebra document.

Issue-0011:  XPath tumbler syntax instead of index?

Date: Jul-26-2000

XPath provides as a shorthand syntax [integer] to select child-elements by their position on the sibling axes, whereas the xml-query algebra uses a combination of a built-in function index() and iteration. See http://lists.w3.org/Archives/Member/w3c-archive/2000Sep/0168.html (W3C-members only) for a suggestion to to support indexed iteration in the form "for v sub i in e1 do e2", and to express index() as a function (or macro).

Addendum by JS (submitted by MF) Dec 19/2000: The typing of index is lossy : it produces a factored type. Jerome suggests the more precise range operator:

e : q{m,n}   n' - (m'-1) = r  m' >= m   n' <= n
-----------------------------------------------
            range(e;m';n') : q{r,r}

nth(e;n) == range(e;n;n)
The range operator takes a repetition of prime types and those values in the range m' to n'; if the repetition does not include that range, a run-time error is raised. The range and nth operators could also be defined in terms of head and tail and polymorphic recursive functions. In the absence of parameteric polymorphism, it is not possible to define range and nth with precise types.

Here are Peter's rules:

e : p{m,n}     n!=* 
-------------------------------------------------
range(e;m';n') : p{n'-max(m,m')+1,min(n',n)-m'+1}

For example:
let v1 = a[]{2,4}

range(v1;3;3): a[]{1,1}
range(v1;1;3): a[]{2,3}
range(v1;3;5): a[]{1,2}
range(v1;1;5): a[]{2,4}

e : p{m,*} 
-----------------------------
range(e;m';n') : p{0,n'-m'+1}

let v2 = a[]{0,*}

range(v2;1;3): a[]{0,2}

this follows the typical semantics for head() and tail():
head(()) = tail(()) = ()
and the semantics behind
range(e;m',n') = tail o ...(m' times)   ... o tail o head,
                 tail o ...(m'+1 times) ... o tail o head,
                 ...
                 tail o ...(n' times)   ... o tail o head

I would have no troubles in restricting ourselves to nth() instead of range() in the algebra (range can always be enumerated by nth()). Furthermore, we should consider whether m',n' can be computed numbers.

Issue-0012:  GroupBy - needs second order functions?

Date: Jul-26-2000

The type system is currently first order: it does not support function types nor higher-order functions. Higher-order functions are useful for specifying, for example, sorting and grouping operators, which take other functions as arguments.

The WG has decided to express groupBy by a combination of for and distinct (see also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only) and [Issue-0042:  GroupBy]):. Thus w.r.t. to GroupBy this Issue is resolved. Because GroupBy is not the only use case for higher order functions, a new issue [Issue-0063:  Do we need (user defined) higher order functions?] is raised.

Issue-0013:  Collations

Date: Jul-26-2000

Collations identify the ordering to be applied for sorting strings. Currently, it is considered to have an (optional parameter) collation "name" as follows: "SORT variable IN exp BY +(expression {ASCENDING|DESCENDING} {COLLATION name}) (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)). An alternative would be to model a collation as a simple type derived from string, and use type-level casting, i.e. expression :collationtype (which is already supported in the XML Query Algebra), for specifying the collation. That would make: "SORT variable IN exp BY +(expression:collationname {ASCENDING|DESCENDING}). But that requires some support from XML-Schema.

More generally, collations are important for any operator in the Algebra that involves string comparison, among them: sort, distinct, "=" and "<".

Issue-0014:  Polymorphic types

Date: Jul-26-2000

The type system is currently monomorphic: it does not permit the definition of a function over generalized types. Polymorphic functions are useful for factoring equivalent functions, each of which operate on a fixed type.

The current type system has already a built-in polymorphic type (lists) and is likely to have more (unordered collections). The question is, whether to allow for user-defined polymorphic types and user defined polymorphic functions.

See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0111.html (W3C-members only) (W3C-members only).

Issue-0015:  3-valued logic to support NULLs

Date: Jul-26-2000

Issue-0016:  Mixed content

Date: Jul-26-2000

The XML-Query Algebra allows to generate elements with an arbitrary mixture of data (of simple type) and elements. XML-Schema only allows for a combination of strings interspersed with elements (aka mixed content). We need to figure out whether and how to constrain the XML-Query Algebra accordingly (e.g. by typing rules?)

The type system has been extended to support the interleaving operator & - see [3.4 Types : abstract syntax]. Mixed content is defined in terms of &.

Issue-0017:  Unordered content

Date: Jul-26-2000

All-groups in XML-Schema, not to be mixed up with [Issue-0049:  Unordered Collections]

The type system has been extended with the support of all-groups - see [3.4 Types : abstract syntax].

Issue-0018:  Align algebra types with schema

Date: Jul-26-2000

The Algebra's internal type system is the type system of XDuce. A potentially significant problem is that the Algebra's types may lose information when converted into XML Schema types, for example, when a result is serialized into an XML document and XML Schema.

James Clark points out : "The definition of AnyComplexType doesn't match the concrete syntax for types since it applies unbounded repetition to AnyTree and one alternative for AnyTree is AnyAttribute." This is another example of an alignment issue.

This issue comprises also issues [Issue-0016:  Mixed content], [Issue-0017:  Unordered content], [Issue-0053:  Global vs. local elements], [Issue-0054:  Global vs. local complex types], [Issue-0019:  Support derived types], substitution groups.

Issue-0019:  Support derived types

Date: Jul-26-2000

The current type system does not support user defined type hierarchies (by extension or by restriction).

Issue-0020:  Structural vs. name equivalence

Date: Jul-26-2000

The subtyping rules in [4.1 Relating data to types] only define structural subtyping. We need to extend this with support for subtyping via user defined type hierarchies - this is related to [Issue-0019:  Support derived types].

Issue-0021:  Syntax

Date: Jul-26-2000

(e.g. for.<-.in vs for.in.do)

The WG has voted for several syntax changes (see also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt(W3C-members only) (W3C-members only), [3.6 Expressions]: "for v in e do e", "let v = e do", "sort v in e by e ...", "distinct", "match case v:t e ... else e".

Issue-0022:  Indentation, Whitespaces

Date: Jul-26-2000

Is indentation significant?

The WG has consensus that indentation is not significant (see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0043.html (W3C-members only) (W3C-members only)), i.e., all documents are white space normalized.

Issue-0023:  Catch exceptions and process in algebra?

Date: Jul-26-2000

Does the Algebra give explicit support for catching exceptions and processing them?

Subsumed by new issue [Issue-0064:  Error code handling in Query Algebra].

Issue-0024:  Value for empty forests

Date: Jul-26-2000

What does "value" do with empty forests?

The definition of value(e) has changed to:

value(e) = match children(e)
                     case v: AnyScalar do v
                     else()

Furthermore, the typing rules for "for v in e1 do e2" have been changed such that the variable v is typed-checked seperately for each unit-type occuring in expression e1.

Consequently the example in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html (W3C-members only) (W3C-members only) would be typed as follows:

query for b in b0/book do
      value(b/year): Integer{0,*}

rather than leading to an error.

Issue-0025:  Treatment of empty results at type level

Date: Jul-26-2000

According to http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0138.html(W3C-members only) (W3C-members only) this is closely related to [Issue-0024:  Value for empty forests].

Resolved by resolution of [Issue-0025:  Treatment of empty results at type level].

Issue-0026:  Project - one tag only

Date: Jul-26-2000

Project is only parameterized by one tag. How can we translate a0/(b | c)?

With the new syntax (and type system) a0/(b | c) can be translated to "for v in a0 do match case v1:b[AnyType] do v1 case v2:c[AnyType] do c else ()" - see also [A.1 Relating projection to iteration].

Issue-0027:  Case syntax

Date: Jul-26-2000

N-ary case can be realized by nested binary cases. For design alternatives of case see: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0017.html (W3C-members only) (W3C-members only)

New (n-ary) case syntax is introduced in [3.6 Expressions].

Issue-0028:  Fusion

Date: Jul-26-2000

Does the Algebra support fusion as introduced by query languages such as LOREL? This is related to [Issue-0005:  Element identity], because fusion only makes sense with support of element identity.

Issue-0029:  Views

Date: Jul-26-2000

One of the problems in views: Can we undeclare/hide things in environment? For example, if we support element-identity, can we explicitly discard a parent, and/or children from an element in the result-set? Related to [Issue-0005:  Element identity]. See also description in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0047.html (W3C-members only) (W3C-members only).

Issue-0030:  Automatic type coercion

Date: Jul-26-2000

What do we do if a value does not have a type or a different type from what is required? See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0071.html (W3C-members only) (W3C-members only). This link also contains a recommendation, which has been agreed as the general direction to go in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only):

We believe that the XML Query Language should specify default type coercions for mixed mode arithmetic should be performed according to a fixed precedence hierarchy of types, specifically integer to fixed decimal, fixed decimal to float, float to double. This policy has the advantage of simplicity, tradition, and static type inference. Programmers could explicitly specify alternative type coercions when desirable.

Issue-0031:  Recursive functions

Date: Jul-26-2000

subsumed by [Issue-0008:  Fixed point operator or recursive functions]

Issue-0032:  Full regular path expressions

Date: Jul-26-2000

Full regular path expressions allow to constrain recursive navigation along paths by means of regular expressions, e.g. a/b*/c denotes all paths starting with an a, proceeding with arbitrarily many b's and ending in a c. Currently the XML-Query Algebra can express this by means of (structurally) recursive functions. An alternative may be the introduction of a fixpoint operator [Issue-0008:  Fixed point operator or recursive functions].

Issue-0033:  Metadata Queries

Date: Jul-26-2000

Metadata queries are queries that require runtime access to type information. See also discussion starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0087.html (W3C-members only) (W3C-members only).

Issue-0034:  Fusion

Date: Jul-26-2000

Identical with [Issue-0028:  Fusion]

Issue-0035:  Exception handling

Date: Jul-26-2000

Subsumed by [Issue-0023:  Catch exceptions and process in algebra?] and [Issue-0064:  Error code handling in Query Algebra].

Issue-0036:  Global-order based operators

Date: Jul-26-2000

Subsumed by [Issue-0003:  Global Order]

Issue-0037:  Copy vs identity semantics

Date: Jul-26-2000

subsumed by [Issue-0005:  Element identity]

Issue-0038:  Copy by reachability

Date: Jul-26-2000

Is it possible to copy children as well as IDREFs, Links, etc.? Related to [Issue-0005:  Element identity] and [Issue-0008:  Fixed point operator or recursive functions]

Resolved by addition of "deep" copy operator in [XML Query Data Model].

Issue-0039:  Dereferencing semantics

Date: Jul-26-2000

subsumed by [Issue-0005:  Element identity]

[Issue-0040:  Case Syntax] through [Issue-0047:  Attributes]  are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0010.html (W3C-members only) (W3C-members only)

Issue-0040:  Case Syntax

Date: Aug-01-2000

We suggest that the syntax for "case" be made more regular. At present, it takes only two branches, the first labelled with a tag-name and the second labelled with a variable. A more traditional syntax for "case" would have multiple branches and label them in a uniform way. If the algebra is intended only for semantic specification, "case" may not even be necessary.

subsumed by [Issue-0027:  Case syntax]

Issue-0041:  Sorting

Date: Aug-01-2000

We are not happy about the three-step sorting process in the Algebra. We would prefer a one-step sorting operator such as the one illustrated below, which handles multiple sort keys and mixed sorting directions: SORT emp <- employees BY emp/deptno ASCENDING emp/salary DESCENDING

The WG has decided to go for the above syntax, with an (optional) indication of COLLATION. (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only), [2.13 Sorting]).

Issue-0042:  GroupBy

Date: Aug-01-2000

We do not think the algebra needs an explicit grouping operator. Quilt and other high-level languages perform grouping by nested iteration. The algebra can do the same.

related to [Issue-0012:  GroupBy - needs second order functions?]

The WG has decided (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)) to skip groupBy for the time being (see also revised [2.11 Restructuring and grouping] and raise [Issue-0069:  Organization of Document] for a possible future revision of this resolution.

Issue-0043:  Recursive Descent for XPath

Date: Aug-01-2000

The very important XPath operator "//" is supported in the Algebra only by writing a recursive function. This is adequate for a semantic specification, but if the Algebra is intended as an optimizable target language it will need better support for "//" (possibly in the form of a fix-point operator.)

Resolved by subsumption under [Issue-0043:  Recursive Descent for XPath] (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt(W3C-members only) (W3C-members only)).

Issue-0044:  Keys and IDREF

Date: Aug-01-2000

We think the algebra needs some facility for dereferencing keys and IDREFs (exploiting information in the schema.)

Subsumed by [Issue-0007:  References: IDREFS, Keyrefs, Joins]

Issue-0045:  Global Order

Date: Aug-01-2000

We are concerned about absence of support for operators based on global document ordering such as BEFORE and AFTER.

subsumed by [Issue-0003:  Global Order]

Issue-0046:  FOR Syntax

Date: Aug-01-2000

We agree with comments made in the face-to-face meeting about the aesthetics of the Algebra's syntax for iteration. For example, the following syntax is relatively easy to understand: FOR x IN some_expr EVAL f(x) whereas we find the current algebra equivalent to be confusing and misleading: FOR x <- some_expr IN f(x) This syntax appears to assign the result of some_expr to variable x, and uses the word IN in a non-intuitive way.

subsumed by [Issue-0021:  Syntax]

Issue-0047:  Attributes

Date: Aug-01-2000

See [Issue-0001:  Attributes].

subsumed by [Issue-0001:  Attributes]

[Issue-0048:  Explicit Type Declarations] through [Issue-0050:  Recursive Descent for XPath]  are raised in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jul/0148.html (W3C-members only) (W3C-members only)

Issue-0048:  Explicit Type Declarations

Date: Jul-27-2000

Type Declaration for the results of a query: The issue is whether to auto construct the result type from a query or to pre-declare the type of the result from a query and check for correct type on the return value. Suggestion: Support for pre-declared result data type and as well as to coerce the output to a new type is desirable. Runtime or compile time type checking is to be resolved? Once you attach a name to a type, it is preserved during the query processing.

W.r.t. compile time type casts this is already possible with e:t (see [3.6 Expressions]). For run-time casts an issue has been raised in [Issue-0062:  Open questions for constructing elements by reference].

Issue-0049:  Unordered Collections

Date: Jul-27-2000

Currently, all forests in the data model are ordered. It may be useful to have unordered forests. The distinct_node function, for example, produces an inherently unordered forest. Unordered forests can benefit from many optimizations for the relational algebra, such as commutable joins.

Handling of collection of attributes is easy but the collection of elements is complex due to complex type support for the elements. It makes sense to allow casting from unordered to ordered collection and vice versa. It is not clear whether the new ordered or unordered collection is a new type or not. It affects function resolution, optimization.

See also thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0135.html (W3C-members only) (W3C-members only).

Our request to Schema to represent insignificance of ordering at schema level has not been fulfilled - see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0136.html (W3C-members only) (W3C-members only). Thus we need to be aware that this information may get lost, when mapping to schema.

Unordered collections are described by {t} see [3.4 Types : abstract syntax], some operators (sort, distinct_node, for, and sequence) are overloaded, and some operators (difference, intersection) are added). A new issue [Issue-0076:  Unordered types] is raised.

Issue-0050:  Recursive Descent for XPath

Date: Jul-27-2000

Suggestion: The group likes to add a support for fixed-point operator in the query language that will allow us to express the semantics of the // operator in an xpath expression. A path expression of the form a//b may be represented by a fixed-point operator fp(a, "/.")/b.

subsumed by [Issue-0043:  Recursive Descent for XPath]

Issue-0051:  Project redundant?

Date: Aug-05-2000

It appears that project a e could be reduced to sth. like

for v <- e in
  case  v of a[v1] => a[v1]
           | v2 => ()

... or would that generate a less precise type?

With the new type system and handling of the for operator, project is indeed redundant. See [A.1 Relating projection to iteration].

Issue-0052:  Axes of XPath

Date: Aug-05-2000

The current algebra makes navigation to parents difficult to impossible. With support of Element Identity [Issue-0005:  Element identity] and recursive functions [Issue-0008:  Fixed point operator or recursive functions] one can express parent() by a recursive function via the document root. More direct support needs to be investigated w.r.t its effect on the type system.

The WG wishes to support a built-in operator parent() (see http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only)). For the current state of affairs see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0074.html (W3C-members only) (W3C-members only). For some use-cases see http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0011.html (W3C-members only) (W3C-members only).

Issue-0053:  Global vs. local elements

Date: Aug-05-2000

The current type system cannot represent global element-declarations of XML-Schema. All element declarations are local.

Issue-0054:  Global vs. local complex types

Date: Aug-05-2000

The current type system does not distinguish between global and local types as XML-Schema does. All types appear to be fully nested (i.e. local types)

Issue-0055:  Types with non-wellformed instances

Date: Aug-05-2000

The type system and algebra allows for sequences of simple types, which can usually be not represented as a well-formed document. How shall we constrain this? Related to [Issue-0016:  Mixed content].

Issue-0056:  Operators on Simple Types

Date: Jul-15-2000

We intentionally did not define equality or relational operators on element and atomic type. These operators should be defined by consensus.

See also first designs for support of arithmetic operators http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0138.html (W3C-members only) (W3C-members only) and for support of operators for date/time http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0113.html (W3C-members only) (W3C-members only).

Ed. Note: MF, 15-Jan-2001 A joint task force on operators with members from the XSLT, XML Schema, and XML Query working groups is chartered to define arithmetic operators.

Issue-0057:  More precise type system; choice in path

Date: Aug-07-2000

(This subsumes [Issue-0051:  Project redundant?]). If the type system were more precise, then (project a e) could be replaced by:

for v <- e in
    case v of
      a[v1] => a[v1]
    | v2 => ()

One could also represent (e/(a|b)) directly in a similar style.

for v <- e in
    case v of
      a[v1] => a[v1]
    | v2 => case v2 of
	      b[v3] => b[v3]
            | v4 => ()

Currently, there is no way to represent (e/(a|b)) without loss of precision, so if we do not change the type system, we may need to have some way to represent (e/(a|b)) and similar terms without losing precision. (The LA team has a design for this more precise type system, but it is too large to fit in the margin of this web page!)

See resolution of [Issue-0051:  Project redundant?]

Issue-0058:  Downward Navigation only?

Date: Aug-07-2000

Related to [Issue-0052:  Axes of XPath]. The current type system (and the more precise system alluded to in [Issue-0057:  More precise type system; choice in path]) seems well suited for handling XPath children and descendant axes, but not parent, ancestor, sibling, preceding, or following axes. Is this limitation one we can live with?

Subsumed by [Issue-0052:  Axes of XPath]

Issue-0059:  Testing Subtyping

Date: Aug-07-2000

One operation required in the Algebra is to test whether XML type t1 is a subtype of XML type t2, indicated by writing t1 <: t2. There is a well-known algorithm for this, based on tree automata, which is a straightforward variant of the well-known algorithm for testing whether the language generated by one regular-expression is a subset of the language generated by another. (The algorithm involves generating deterministic automata for both regular expressions or types.)

However, the naive implementation of the algorithm for comparing XML types can be slow in practice, whereas the naive algorithm for regular expressions is tolerably fast. The only acceptably fast implementation of a comparison for XML types that the LA team knows of has been implemented by Haruo Hasoya, Jerome Voullion, and Benjamin Pierce at the University of Pennsylvania, for their implementation of Xduce. (Our implementation of the Algebra re-uses their code, with permission.)

So, should we adopt a simpler definition of subtyping which is easier to test? One possibility is to adopt the sibling restriction from Schema, which requires that any two elements which appear a siblings in the same content model must themselves have contents of the same type. Jerome Simeon and Philip Wadler discovered that adopting the sibling restriction reduces the problem of checking subtyping of XML types to that of checking regular languages for inclusion, so it may be worth adopting the restriction for that reason.

Issue-0060:  Internationalization aspects for strings

Date: Jun-26-2000

These issues are taken from the comments on the Requirements Document by I18N (http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Jun/0137.html (W3C-members only) (W3C-members only)).

Further information can be found at http://www.w3.org/TR/WD-charreq.

It is a goal of i18n that queries involving string matching ("select x where x='some_constant'") treat canonically equivalent strings (in the Unicode sense) as matching. If the query and the target are both XML, early normalization (as per the Character Model) is assumed and binary comparison ensures that the equivalence requirement is satisfied. However, if the target is originally a legacy database which logically has a layer that exports the data as XML, that XML must be exported in normalized form. The XML Query spec must impose the normalization requirement upon such layers.

Similarly, the query may come from a user-interface layer that creates the XML query. The XML Query spec must impose the normalization requirement upon such layers.

Provided that the query and the target are in normalized form C, the output of the query must itself be in normalized form C.

Queries involving string matching should support various kinds of loose matching (such as case-insensitivity, katakana-hiragana equivalence, accent-accentless equivalence, etc.)

If such features as case-insensitivity are present in queries involving string matching, these features must be properly internationalized (e.g. case folding works for accented letters) and language-dependence must be taken into account (e.g. Turkish dotless-i).

Queries involving character counting and indexing must take into account the Character Model. Specifically, they should follow Layer 3 (locale-independent graphemes). Additional details can be found in The Unicode Standard 3.0 and UTR#18. Queries involving word counting and indexing should similarly follow the recommendations in these references.

Issue-0061:  Model for References

Date: Aug-16-2000

Raised in: http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Aug/0063.html (W3C-members only) (W3C-members only). Related to a number of issues around [Issue-0005:  Element identity].

The following issues have been raised since Sep-25-2000.

Issue-0062:  Open questions for constructing elements by reference

Date: Sep-25-2000

(1) What is the value of parent() when constructing new elements with children refering to original nodes? See also discussion at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0155.html (W3C-members only) (W3C-members only).

(2) Is an approach to either make copies for all children or provide references to all children, or should we allow for a more flexible combination of copies and references?

Operational semantics specifies that element node constructor creates copies of all its children. Addition of RefNode in [XML Query Data Model] supports explicit reference value.

Issue-0063:  Do we need (user defined) higher order functions?

Date: Oct-16-2000

The current XML-Query-Algebra does not allow functions to be parameters of another function - so called higher order functions. However, most of the Algebra operators are (built-in) higher functions, taking expressions as an argument ("sort", "for", "case" to name a few). Even a fixpoint operator, "fun f(x)=e, fix f(x) in e" (see also [Issue-0008:  Fixed point operator or recursive functions]), would be a built-in higher order function.

As agreed in http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only) the XML Query Algebra will not support user defined higher order functions. It does support a number of built-in higher order functions.

Issue-0064:  Error code handling in Query Algebra

Date: Oct-04-2000

How do we return an error code from a function defined in current Query algebra. Do we need to create an array (or a structure) to merge the return value and error code to do this. If that is true, it may be inefficient to implement. In order for cleaner and efficient implementation, it may be necessary to allow a function declaration to take a parameter of type "output" and allow it to return an error code as part of the function definition. See also thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0051.html(W3C-members only) (W3C-members only).

One does not need to create a structure to combine return values with error codes, provided each operator or function /either/ returns a value /or/ raises an error. The XML-Query Algebra supports means to raise errors, but does not define standard means to catch errors. Raising errors is accomplished by the expression "error" of type Ø (empty choice). Because Ø | t = t, such runtype errors do not influence static typing. The surface syntax and/or detailed specification of operators on simple types (see [Issue-0056:  Operators on Simple Types]) may choose to differentiate errors into several error-codes.

Issue-0065:  Built-In GroupBy?

Date: Oct-16-2000

As discussed in http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only), we may revisit the resolution of [Issue-0042:  GroupBy] and reintroduce GroupBy along the lines of sort: "group v in e1 by [e2 {collation}]". One reason for this may be that this allows to use collation for deciding about the equality of strings.

In http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only) the WG has decided to close this issue, and for the time being not consider GroupBy as a built-in operator. Furthermore, [Issue-0013:  Collations] is ammended to deal with collations for all operators involving a comparison of strings.

Issue-0066:  Shallow or Deep Equality?

Date: Oct-16-2000

What is the meaning of "=" and "distinct"? Equality of references to nodes or deep equality of data?

[XML Query Data Model] defines "=" (value equality) and "==" (identity equality) operators. Description of distinct states that it uses "==".

Issue-0067:  Runtime Casts

Date: Sep-21-2000

In some contexts it may be desirable to cast values at runtime. Such runtime casts lead to an error if a value cannot be cast to a given type. See also http://www.w3.org/XML/Group/2000/09/ql/unedited-minutes-day1.txt (W3C-members only) (W3C-members only), where the Algebra team has been put in charge of introducing run-time casts into the Algebra.

cast e : t has been introduced as a reducible operator expressed in terms of match (see [A.2 Reducible expressions]).

Issue-0068:  Document Collections

Date: Oct-16-2000

Per our requirements document we are chartered to support document collections. The current XML-Query Algebra deals with single documents only. There are a number of subissues:

(a) Do we need a more elaborate notion of node-references? E.g. pair of (URI of root-node, local node-ref)

(b) Does the namespace mechanism suffice to type collections of nodes from different documents? Probably yes.

(c) Provided (a) and (b) can be settled, will the approach taken for [Issue-0049:  Unordered Collections] do the rest?

Issue-0069:  Organization of Document

Date: Oct-16-2000

The current document belongs more to the genre (scientific) paper than to the genre specification. One may consider the following modifications: (a) reorganize intro to give a short overview and then state the purpose (strongly typed, neutral syntax with formal semantics as a basis for possibly multiple syntaxes, etc.) (compared to version Aug-23, this version has already gone a good deal in this direction). (b) Equip various definitions and type rules with id's. (c) Elaborate appendices on mapping XML-Query-Algebra Model vs. XML-Query-Datamodel, XML-Query-Type System vs. XML-Schema-Type System. (d) Maybe add an appendix on use-case-solutions. The problem is of course: Part of this is a lot of work, and we may not achieve all for the first release.

At http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0196.html (W3C-members only) (W3C-members only) the WG decided to dispose of this issue. The current overall organization of the document is quite adequate, but of course editorial decisions will have to made all the time.

Issue-0070:  Stable vs. Unstable Sort/Distinct

Date: Oct-02-2000

Should sort (and distinct) be stable on ordered collections, i.e. lists, and unstable on unordered collections (see [Issue-0049:  Unordered Collections])? For more details see thread around http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0007.html (W3C-members only) (W3C-members only).

sort and distinct are stable on ordered collections, and unstable on unordered collections - see [4.7 Typing unordered forests].

Issue-0071:  Alignment with the XML Query Datamodel

Date: Sep-26-2000

Currently, the XML Query Algebra Datamodel does not model PI's and comments. For more details see thread starting with http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Sep/0167.html (W3C-members only) (W3C-members only).

Addition of operational semantics defines relationship of Algebra to Data Model.

Issue-0072:  Facet value access in Query Algebra

Date: Oct-04-2000

Each of the date-time data types have facet values as defined by the schema data types draft spec. This problem is general enough to be applied to other atomic data types.

The question is : Should we provide access to these facet values on an instance of a particular data types? If so, what type of access? My take is the facets are to be treated like read-only attributes of a data instance and one should have a read access to them. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0044.html (W3C-members only) (W3C-members only).

Issue-0073:  Facets for simple types and their role for typechecking

Date: Oct-16-2000

XML-Schema introduces a number of constraining facets http://www.w3.org/TR/xmlschema-2/#dc-facets for simple types (among them: length, pattern, enumeration, ...). We need to figure out whether and how to use these constraining facets for type-checking. See also thread starting at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2000Oct/0146.html(W3C-members only) (W3C-members only).

Issue-0074:  Operational semantics for expressions

Date: Nov-16-2000

It is necessary to add an operational semantics that formally defines each operator in the Algebra.

Issue-0075:  Overloading user defined functions

Date: Nov-17-2000

User defined functions can not be overloaded in the XML Query Algebra, i.e., a function is exclusively identified by its name, and not by its signature. Should this restriction be relaxed and if so - to which extent?

Issue-0076:  Unordered types

Date: Dec-11-2000

Currently unorderedness is represented at type level by {t}, and some (built-in) operators are overloaded such they have different semantics (and potentially different return type) depending on their input type. An alternative is to not represent unorderedness at type level, but rather support unordered for, unordered (unstable) sort, unordered (unstable) distinct.

Issue-0077:  Interleaved repetition and closure

Date: Dec-12-2000

Regular Languages are closed w.r.t. to the interleaved product. However, they are not closed w.r.t. to interleaved repetition, which can (e.g) generate the 1 degree Dyck language D[1] = () | a D[1] b | D[1] D[1] = (a,b)^{0,*}, and more generally, any language that coordinates cardinalities of individual members from an alphabeth: E.g. (a ^ b)^{0,*} = all strings with equally many a's and b's. These are beyond regular languages. Should we thus try to do without interleaved repetition?

Issue-0078:  Generation of ambiguous types

Date: Dec-12-2000

Unambiguous content-models in XML 1.0 and XML Schema are not closed w.r.t. union. It appears that the XML Query-Algebra can generate result types which can not be transformed to an unambiguous content-model.

Issue-0079:  Global order between nodes in different documents

Date: Dec-16-2000

The global order operator < is defined on nodes in the same document, but not between nodes in different documents.

Issue-0080:  Typing of parent

Date: Dec-16-2000

Currently, the parent operator yields an imprecise type : AnyElement{0,1}. It might be possible to type parent more precisely, for example, by using the normalized names in MSL, which encode containment of types.

Issue-0081:  Lexical representation of Schema simple types

Date: Jan-17-2001

Schema simple types must be defined for the Algebra and XQuery.

Issue-0082:  Type and expression operator precedence

Date: Jan-17-2001

The precedence of expression operators and type operators is not defined.

Issue-0083:  Expressive power and complexity of match expression

Date: Jan-17-2001

When processing an XML document without schema information, i.e., the type of the document is AnyComplexType, then match expressions may be very expensive to evaluate:

  match x
  case t1 : AnyTree do 1
  case t2 : AnyTree{0,2} do 2
  case t3 : *[*[*[*[* ... [AnyAttribute] ]]]] do 3 
  else error
  
match itself is not the issue. The real problem is having very liberal type patterns. We could restrict the kinds of type patterns that we permit.

Issue-0084:  Execution model

Date: Jan-17-2001

Need prose describing execution model scenarios : interpretor vs. compile/runtime vs. translation into another query language. Explain relationship between static and dynamic semantics.

Issue-0085:  Semantics of Wildcard type

Date: Jan-17-2001

Cite: wildcard types cannot be implemented ( Section 2.12: Expanded names, paragraph 11 http://www.w3.org/XML/Group/2000/12/xmlquery-algebra20001204.html; critical, core) If x!y means any name in x except names in y, what does x!y!z mean? In general, how do ! and | operate (precedence, associativity)? Parentheses are required to force the desired grouping of these two operators. Also, what does x!* mean? (There's an infinite family of such examples.)

Issue-0086:  Syntactic rules

Date: Jan-17-2001

Need rules for specifying syntactic correctness of query: symbol spaces; variable def'ns precede uses; list of keywords, etc.

Issue-0087:  More examples of Joins

Date: Jan-17-2001

Cite: no join operator; wants example of many-to-many joins, inner join, left and full outer joins.

B.3 Alphabetic list of issues

B.3.1 Open Issues

Number: 40

[Issue-0015:  3-valued logic to support NULLs]
[Issue-0018:  Align algebra types with schema]
[Issue-0030:  Automatic type coercion]
[Issue-0052:  Axes of XPath]
[Issue-0013:  Collations]
[Issue-0068:  Document Collections]
[Issue-0084:  Execution model]
[Issue-0083:  Expressive power and complexity of match expression]
[Issue-0009:  Externally defined functions]
[Issue-0073:  Facets for simple types and their role for typechecking]
[Issue-0072:  Facet value access in Query Algebra]
[Issue-0008:  Fixed point operator or recursive functions]
[Issue-0032:  Full regular path expressions]
[Issue-0028:  Fusion]
[Issue-0078:  Generation of ambiguous types ]
[Issue-0079:  Global order between nodes in different documents]
[Issue-0054:  Global vs. local complex types]
[Issue-0053:  Global vs. local elements]
[Issue-0077:  Interleaved repetition and closure]
[Issue-0060:  Internationalization aspects for strings]
[Issue-0081:  Lexical representation of Schema simple types]
[Issue-0033:  Metadata Queries]
[Issue-0061:  Model for References]
[Issue-0087:  More examples of Joins]
[Issue-0074:  Operational semantics for expressions]
[Issue-0056:  Operators on Simple Types]
[Issue-0075:  Overloading user defined functions]
[Issue-0014:  Polymorphic types]
[Issue-0007:  References: IDREFS, Keyrefs, Joins]
[Issue-0085:  Semantics of Wildcard type]
[Issue-0020:  Structural vs. name equivalence]
[Issue-0019:  Support derived types]
[Issue-0086:  Syntactic rules]
[Issue-0059:  Testing Subtyping]
[Issue-0082:  Type and expression operator precedence]
[Issue-0055:  Types with non-wellformed instances]
[Issue-0080:  Typing of parent]
[Issue-0076:  Unordered types]
[Issue-0029:  Views]
[Issue-0011:  XPath tumbler syntax instead of index?]

B.3.2 Resolved (or redundant) Issues

Number: 47

[Issue-0071:  Alignment with the XML Query Datamodel]
[Issue-0001:  Attributes]
[Issue-0047:  Attributes]
[Issue-0065:  Built-In GroupBy?]
[Issue-0027:  Case syntax]
[Issue-0040:  Case Syntax]
[Issue-0023:  Catch exceptions and process in algebra?]
[Issue-0010:  Construct values by copy]
[Issue-0038:  Copy by reachability]
[Issue-0037:  Copy vs identity semantics]
[Issue-0039:  Dereferencing semantics]
[Issue-0063:  Do we need (user defined) higher order functions?]
[Issue-0058:  Downward Navigation only?]
[Issue-0005:  Element identity]
[Issue-0064:  Error code handling in Query Algebra]
[Issue-0035:  Exception handling]
[Issue-0048:  Explicit Type Declarations]
[Issue-0046:  FOR Syntax]
[Issue-0034:  Fusion]
[Issue-0003:  Global Order]
[Issue-0045:  Global Order]
[Issue-0036:  Global-order based operators]
[Issue-0042:  GroupBy]
[Issue-0012:  GroupBy - needs second order functions?]
[Issue-0022:  Indentation, Whitespaces]
[Issue-0044:  Keys and IDREF]
[Issue-0016:  Mixed content]
[Issue-0057:  More precise type system; choice in path]
[Issue-0002:  Namespaces]
[Issue-0062:  Open questions for constructing elements by reference]
[Issue-0069:  Organization of Document]
[Issue-0026:  Project - one tag only]
[Issue-0051:  Project redundant?]
[Issue-0043:  Recursive Descent for XPath]
[Issue-0050:  Recursive Descent for XPath]
[Issue-0031:  Recursive functions]
[Issue-0004:  References vs containment]
[Issue-0067:  Runtime Casts]
[Issue-0066:  Shallow or Deep Equality?]
[Issue-0041:  Sorting]
[Issue-0006:  Source and join syntax instead of "for"]
[Issue-0070:  Stable vs. Unstable Sort/Distinct]
[Issue-0021:  Syntax]
[Issue-0025:  Treatment of empty results at type level]
[Issue-0049:  Unordered Collections]
[Issue-0017:  Unordered content]
[Issue-0024:  Value for empty forests]