XQuery: A unified syntax for linking and querying general XML documents

By Steven J. DeRose, Chief Scientist, Inso Corporation and the Brown Univ. Scholarly Technology Group.

Written September 24, 1998. Last updated October 8, 1998

Introduction

This document proposes a query language syntax for XML documents, called ZQuery. Such a query language has quite different requirements than traditional languages; much more different than is commonly appreciated. Many past proposals have taken a basically relational query language (typically SQL), and modified it by the addition of a few constructs: typically a "contains" operator and some features for matching strings within text chunks against regexes, or against word-roots. Such features, while needed, are not enough. The hard problem arises because the most basic design principles of relational databses, do not hold for XML documents. These include:

Data is organized in a 3-level hierarchy: table, record, and field.
The order of records in a table, and that of fields in a record, is not part of the information, and may be modified or discarded by the database at any time.
Fields have no internal structure that the database knows about: they are atomic units, typically of very concrete types such as integer, date, and string (plus perhaps blobs that the database can maintain, though not understand.
Strings that occur in separate fields are separate.
There is no notion of "distance" between data objects.
The context or location of data is not generally significant (this relates to the issue of order and hierarchy): an identical tuple is still itself if you copy it (clearly not true for natural-language documents such as XML must be able to represent).
Linking is far more often part of the data, not part of the schema used to represent the data.

Query language for such data are designed to take good advantage of these characteristics, and derive much of their power from that. However, for exactly that reason they do not work well for queries on XML, where data is richly hierarchical, ordered, and filled to the brim with aggregate objects with much insternal structure: in keeping with the human-language objects they represent.

This document builds directly on my earlier Queries on Links and Hierarchies, which lays out the concrete requirements for operations to be supported in a full-powered XML query language, but provides no syntax. I consider that the correct order of design: If you have not laid out in detail what it is you plan to build, you cannot even tell whether your concrete plan makes sense or not.

The syntax here builds directly on XPointer, for several reasons.

First, if possible it would be nice to have one fewer syntax in the world: less implementation work, less teaching and learning work, fewer confusions, and so on. I did not, however, start this work with the assumption that XPointer and a full-blown query language must share syntax (though they obviously must share some semantics since they operate on the same kind of data objects).
Second, XPointer is specifically designed for XML, and contains a set of navigational-access operators that is clearly required for dealing with ordered trees, but is lacking in most other models that could be found (OQL does have a similar mechanims in terms of paths).
Third, XPointer has proven extremely reliable and robust over a long period. It is not widely known, but XPointer is very closely modeled on Text Encoding Initiative extended pointers, which were designed in an open, international body with representation from many fields. They have also been widely implemented; even some HyTime implementations in fact store all their links underneath, as TEI extended pointers, and translate in and out!
XPointer is so tiny a step from TEI extended pointers that we are justified in being confident that it will work equally well -- about all that has changed is the subsitution of other punctuation marks for space (since spaces do not fit well in URLs), and the deletion of a few of the fancier constructs. This puts Xpointer in the honorable and successful tradition of XML, which subsetted and very slightly enhanced an established, proven standard.
XPointer has been out there long enough (even though in Working Draft form), that several implementations have been done. Implementors have noted several minor errors in the prose of the working draft, but found no substantive problems in implementation, and no significant gaps in functionality for the purposes stated.

Datatypes

XQUery is aware of data objects of several atomic datatypes, like most other query languages. It also has a well-developed notion of aggregate types, specifically designed for XML, where documents and their portions are ordered hierarchies of typed objects, each with properties (XML calls the properties "attributes", a slightly different meaning of the term than in database theory!). All the datatypes used in XQuery are defined here, and given symbolic names for convenient later reference:

Atomic datatypes, along with their format when appearing as constants in XQuery, include:

Boolean values (B): represented by %T and %F.
Integer values (I): represented in various bases. Their ranges, or the size of their representation on any particular machine, is not normally of interest to the query language. However, in the interest of interoperability implementations are required to support at least the range of 32-bit 2's complement signed integers. The literal formats allowed for integer include:
- Decimal (999)
- Hexadecimal (x'DEADBEEF), with case of the x and a-f ignored.
- Octal (o'377), with, with case of the o ignored.
Integer ranges (I*): are a special aggregate, used mainly to pick particular subsets out of a set, or to compare an integer against a variety of acceptable values and/or ranges. They only occur as literals. It is easier to define them by example than by prose; the grammar is also available.... These are examples of acceptable integer ranges:
```
   {1}
   {1,3,5}
   {1, 3-5, 7, 9-11}
   {...1, 3-99, 100...}
```
As is surely obvious, one can easily specify quite complex choices if needed. In practice, most ranges will probably have only a single integer or range.
Real values (R): represented in scientific notation. Their ranges, or the size of their representation on any particular machine, is not normally of interest to the query language. However, in the interest of interoperability implementations are required to support at least the dynamic range of IEEE floating-point doubles. The literal format is that of typical programming langauges, in particular C: -3.1415926E+32 (with the exponent suffix and both signs optional, and the 'e' ignoring case distinctions.
Date (D): A calendar date, in the ISO 2014/R form: yyyy-mm-dd (we won't worry for now about the "year 10,000" problem). Conversion of dates that may occur in XML documents in other formats is outside the scope of this proposal, because such formats vary too widely to be incorporated in a manageable standard. Applications requiring other forms can add their own support for the particular forms they need at another level.
Clock Time (C): A clock time in the ISO ____ form: hh:mm:ss.nnnnn+5. XQuery applications must support times accurate to at least one millisecond. The optional second field, prefixed by a mandatory plus or minus, indicates time zones other than GMT (plus traveling East).
Date-Time (T): The combination of date and time is provide for convenience in comparing: if one wishes to compare absolute times (even as simple as email time-stamps), it is much easier to do so, than to compare dates and then, if equal, compare times in some complex conditional expression. A date-time literal has simply the syntax of a date, an at-sign "@", and a time.
Unicode strings (S): represented in any character encoding supported by XML. Literal strings always appear quoted, and may be quoted by single or double straight quotes, or by Unicode curly quotes. The means by which users or applications specify the encodings to be used in a given query, are beyond the scope of this proposal.
XML names (N): which are strings with syntactic constraints defined by XML, and used to refer to element types, attribute names and values, and other constructs defined by XML. Literal names are string but appear without being quoted. They can represent attribute names, element type names, or names of XQuery constructs; the syntax prevents ambiguity.

NOTE: DO we need special syntax to mark GIs, or date-times? It would be nice to be able to just drop a GI in bare as a predicate. It should be unambiguous to treat a name as a GI when it occurs in a Boolean context; but the grammar needs to be thoroughly checked to be sure there are no ambiguous cases. If there is some insoluble ambiguity, the easiest solution would be to add some prefix operator (maybe "*") to mark names used as GIs. The idea is to keep it very lightweight because GIs will likely be used so very often.

Data objects types include:

Node (ND): any node of the tree representing the structure of an XML document once parsed. The full detailas of what tree results from a given XML document are being finalized by DOM, the XML Information Set Working Group, and perhaps others. However, it is clear that the tree consists largely of an ordered n-ary tree. Each node of the tree may have attributes, and is of some major sub-type such as listed below. All these, however, are sub-types of node, and represent the locations of information in XML, not the information in isolation from its context. Most operations can apply equally to all sub-types of node.
- Element (EL): Only elements can have descendants other than strings, or have attributes (again, I use the term "attribute" strictly in the XML sense here, not the databse sense).Elements also have a type, named by an XML name, which receives privileged treatement in XQuery.
Processing instruction (PI): Contains only text. Some proposals have been made to reveal internal structure, such as the target name. Also, many PIs tend to look like a set of attribute assignments from an XML start-tag, and it has been proposed that in such cases the (pseudo-) attributes be accessible from the PI node in precisely the manner that true attributes are accessible from element nodes. This extension would require only trivial changes to this proposal.
Comment (CM): Contains only text. These may not occur at all, depending on the work of DOM and XML-IS.
Text (TX): Most XML documents eventually get down to Unicode text content at the bottom. This node-type represents such data. It differs from String in that it is a sub-type of node: a location, not just a sequence of bytes in isolation from where they occurred in a document context.

Finally, there is a special type for attributes, since they are ancillary to the basic ordered tree structure (they are not "children" or "ancestors" in the tree, and they are not ordered). This could be considered a sub-type of node, but it is probably treated better as a separate type, since it does not support many methods that all other nodes do, such as ordering comparisons; and lack some relationships that other nodes have, such as ancetors (they have an owning element, which is not quite the same thing). Thus the type:

Attribute (AT): This object hangs off of element nodes (only), and each one has a name by which it can be retrieved from the element node. There is by definition in XQuery, no ordering of attributes, and no way to enumerate all the attributes on a given element (this, of course, must be provided in appliations other than querying, and DOM and XML-Is should do so).

These types cover all the data that occurs in documents.

The type for a returned location (often called a search hit), is called a location. These include references to any node or attribute, or to any selection (a single contiguous span, typically denoted by its starting and ending point).

Locations arise from queries, and come in sets. Singleton locations do not figure directly in the query language, since any query returns a set even if the set is of size 0 or 1. Location sets are always ordered, in the order their members start in the document (thus, a node's parent element precedes it in this order. Nested location sets are not allowed; that is, a location set cannot contain another location set as a member (it can certainly contain members that refer to data objects that are themselves aggregates). Recall that locations and location sets are not data primitives; they are the results of queries.

XQuery provides this special type for the results of queries, which is broader than any of the other types. This is a very precisely designed extension to the traditional notion of closure in database theory, which is necessary for several reasons, among them:

The most obvious reason is simply that the objects returned from a query do not constitute an XML document. If you retrieve all the paragraphs, they do not have a single root element, and therefore are not XML.
If you tried to make a result into an XML document by adding an artificial root element, you have well-formed XML, but you have modified the data -- not something a query ought to do lightly. Also, the result is not likely to be valid, at least under the schema in which it originated.
XQuery must also be able to return things that are not database objects at all, such as the range (sometimes called "span") from one character to a later one not in the same element, or a group of elements that do not exhaust the contents of any common ancestor. This is so basic in working with documents that it seems invisible: but such odd structures are what users create any time they drag a selection with their mouse.

Basic XQuerying

An XQuery consists of a sequence of steps, using the "location terms" of XPointer. Each step selects nodes, either in absolute terms or based on the prior set (initially, relative to the root of the document(s) being queried). This iterative approach enables easy queries to be handled with very little overhead, while permitting arbitrarily complex queries to be specified clearly as well.

In effect, a step maps a prior location set (the "location source"), into a (generally different) location set. A step does this by applying some process to the input set; the particular processes available are built into XQuery. All of the XPointer location terms are identically available in XQuery; the most relevant ones here are:

id	selects the element with the given id	?
child	selects from among the direct children of the current location	/ (or better, simply use as the default)
descendant	selects from among all nodes in the subtree headed at the current location	//
ancestor	selects from among the containers of the current location	^
psibling	selects from among preceding siblings	<
fsibling	selects from among following siblings	>
preceding	selects from among all preceding nodes	<<
following	selects from among all following nodes	>>
attr	selects from among the attributes of an element	@
string	selects a range by matching a string	$
span	selects a range spanning two (embedded) XPointers

It may be useful to provide shorter names as alternatives, such as abbreviations or special characters, to specify each of these functions. With punctuation instead of names the dot may also be left out without ambiguity. However, the desire for brevity should be balanced by the need for clarity, implementability of the grammar, and by the limited set of characters permitted within URLs and the need for them for other purposes.

XQuery adds a few more location terms, and a few more ways to combine the existing ones.

The most usual operation a step does is to iterate over the members of the location source set, and for each one, either

evaluate a Boolean expression to determine whether to keep the member in the result set (called "filtering"), or
evaluate a location expression to generate any number of members for the result set (called "mapping".

This process is repeated for each step; steps are joined by ".", much like "path expressions" in OQL. This is because they typically traverse paths through the XML structure to a desired set of final data objects.

The initial location set is one or more documents. If more than one, they must be specified in some order, and that relative order will be maintained in any location sets that contain members from more than one document.

For example, in the document(s) queried this XQuery finds all FOOTNOTE elements on which the TYPE attribute has the value 'CITATION', and then iterates over them to find all their REF children):

   Descendant(FOOTNOTE & TYPE='CITATION').(REF)

Assume a schema in which an individual FOOTNOTE element may consolidate a number of references, not just one. In this context, a bare name in a Boolean context represents a test against the element type of the current candidate (as for FOOTNOTE above), while a bare name in a string context represents the value of the named attribute of the current candidate (as for TYPE above).

This is precisely XPointer syntax, except that it assumes approval of some syntax details already proposed: changing the sequence-of-parameters for expressing "and", to an actual operator (&); changing the default location term from "last used" to "child"; and moving the instance-number parameter out (the advantages of this are discussed below).

A location term determines what candidate nodes are to be examined at all, relative to each location source member. In this case, the use of Descendant indicates that (for each member of the location source) all nodes in the subtree are to be examined.

Obviously the predicate used to filter such candidates can be simple or arbitrarily complex; it is merely a Boolean expression. The simplest case is the empty predicate, "()", which counts as true and so returns all the candidates the location term type allows. For example, since "child" is the defailt location term, either these returns all children of the location source:

   child()
   ()

One particularly useful case is that an entire XQuery can be embedded: It is evaluated (like the rest of the predicate) for each candidate, and counts as true if it returns anything: that is, if the result set it produces is non-null. So this finds all SEC elements that have at least one descendant (however deep) with attribute LEVEL='SECRET':

   descendant(SEC & descendant(LEVEL='SECRET')

TECHNOID NOTE: The evaluation of such an embedded query is identical to what would happen if it appeared as the next step (that is, after the dot instead of inside the parenthesized predicate). However, the result when used within a predicate is only tested for whether it is empty. A smart implementation will therefore halt as soon as it finds even one result for an embedded sub-query.

The next query does the same thing as the first, but picks the first two REFs from each FOOTNOTE, and iterates over them to findany AUTHOR elements which each refers to (via XPointers).

   Descendant(FOOTNOTE & TYPE='CITATION').(REF){1-2}.link(role=AUTHOR)

Notice that the range specification is not separated from (REF) by a dot -- it is therefore part of the same step. The significance of this is that it applies during the evaluation of that step, and picks the first two REFs found for each member of the prior location source -- that is, it will attempt to find twice as many REFs as the number of FOOTNOTEs found in the prior step.

One might instead want to set an absolute maximum on the number of REFs to be found, regardless of how many or few FOOTNOTES it take to accumulate them. To do this simply delay the range until the next step, when all the REFs found are in a single location set:

   Descendant(FOOTNOTE & TYPE='CITATION').(REF).{1-2}

Nothing prevents doing both when that is the desired behavior.

Any XPointer location terms may be used in this manner. An embedded XPointer results in a location, not a string; but one can test the actual text content of that location merely by comparing the location against a string (an implicit cast). The content can be the concatenated text of an entire element, subtree, or span; the value of an attribute; the contents of a PI, comment, or other node. The resulting string can then be compared, matched, or otherwise tested to resolve the predicate. For example:

   Descendant(FOOTNOTE & TYPE = ancestor(CHAPTER).attr(TYPE))

Comparing two locations without casting to strings, determines whether they are indeed the very same location. To compare the content of two locations, they must be cast from locations to their content strings explicitly, using the "*" operator (by analogy to C). For example, to locate FOOTNOTEs where the TYPE of their AUTHOR child matches the AUTHOR of the containing CHAPTER, you could do this:

   Descendant(FOOTNOTE & (child(AUTHOR).attr(TYPE) =
*(ancestor(CHAPTER).attr(AUTHOR))

NOTE: Technically, a bare attribute name in a string comparison is merely shorthand for "*attr(name)". But it's awfully useful shorthand. Some might prefer "@name" because it makes the distinction between element types and attributes more obvious; this would not be objectionable.

NOTE: A means may also be desirable for comparing values across candidates: finding sequences of 2 (or n) FOOTNOTEs which all share the same TYPE, for example. I have not addressed this yet.

Links in queries

Following Abiteboul (1997), one can integrate links into queries; indeed failing to do this for an XMl query language seems absurd. XQuery expresses this by adding a new location term beyond those provided already by XPointer: link().

Link is just like other location terms in that, given a location source it generates a list of candidate locations. However, those locations are not related to the location source members by the "genetics" of an XML document, but rather by the XPointer linkages known to the document. One can think of this as just another kind of relationship. Thus, this XQuery does what the one above does, but for each of the REFs that is eventually found, all locations it links to are returned:

   Descendant(FOOTNOTE & TYPE='CITATION').(REF).{1-2}.link()

This example locates document portions based on non-link information, but then follows the link to determine what to return. Links can also be used in a different way, as part of predicates just like any other location terms. Probable uses for them would include filtering the links by role or other properties defined in XLink.

TECHNICAL NOTE: David Durand has correctly pointed out the need to have queries find either links per se, or data linked to. This distinction can be made either by using normal XQUery constructs to point to the nodes consistituting one or the other, or by building the distinction directly into the linking support (either by additional location terms for linking, or by a parameter).

For symmetry in the face of XLink's bidirectional linking capabilities, a location term linked() is added: it returns all those nodes that link to the location source member.

NOTE: The link() and linked() terms need some parameters in addition to a predicate. In particular, support is needed for indirect linkage (hopefully of limited levels), and for distinguishing cases where the link (at least in the reverse case) is partial: should one include nodes that link to part of the location source member, but not all of it? XQuery does not yet address this.

Operators and Expressions

As in any logical language, constants and data objects can be compared, and comparisons can be combined via logical operators and parentheses.

XQuery provides the normal comparison operators, <, <=, =, <>, >, and >=. These can be applied to integers, booleans, dates, times, and date-times, and return a Boolean value. They can also apply to nodes and to singleton locations, in which case they constitute a comparison of order (only the same actual node, or locations that refer to the same actual node, compare equal). the operands of a comparison must be of the same type, or be castable to the same type under rules defined below.

TECHNOID NOTE: Node locations form a full ordering based on start points, since no two distinct nodes have the same starting point. However, selections cannot be fully ordered based solely on starting point: two selections may start at precisely the same point but have different extents. In such a case, the span that ends sooner compares as less than the other.

XQuery provides the normal logical operators, &, |, and unary ~ for not. They take and return Boolean values. However, most types can be cast to a Boolean, as follows:

Boolean: no cast necessary
Integer and Real: non-zero values are true
Strings: strings of length > 0 are true
Date, Time, and Date-time: no cast allowed
Name: can be automatically cast to Boolean when the expression is in a context, such as a predicate, where there is a notion of "the location being examined". In such cases (such as when it constitute the entire predicate or an operand of a Boolean operator), a name can be cast to a Boolean. It is considered true if the node is an element and its element type name is equal to the name.
Node, Attribute: no cast allowed
Location: no cast allowed
Loation set: a non-empty location set counts as true

Set operations

XQuery provides the usual operations on sets; their arguments are location sets, and so are generally embedded queries. The operators are:

union(set, set,...): Combines the results of two sub-queries into a single location set (with, as always, no duplicates). Since locations are fully ordered, the order of the result is determined. So, to locate the elements with ids foo and bar, as well as all descendants of type SEC:
```
   union( id(foo), id(bar), descendant(SEC) )
```
intersection(set,...): Combines the results of two sub-queries into a single location set. Since locations are fully ordered, the order of the result is determined. Thus, to find all ITEM that contain both 'dog' and 'cat':
```
   intersection( descendant(ITEM & string('dog')),
                 descendant(ITEM & string('cat')))
```
The same result can sometimes be achieved through ANDing several conditions (such as the two string() terms in this example).
difference(set1, set2): Returns a set contaning all members that are in set1, but not in set 2. In effect, subtracts. Some set differences can also be expressed by more complex constraints. This finds all following siblings that are DIVs, except any that in which all descendants with LEVEL='SECRET':
```
   difference(fsibling(DIV),
              ID(SECRET))
```
complement(set): Returns a set contaning all nodes in the query's initial search domain that are not in the specified set.

NOTE: It may be reasonable to express union() via the "+" operator applied to sets; difference() as "-", and intersection as "*". In keeping with the international focus of XML, the Unicode characters for union, intersection, and difference are allowed as infix operators.

Filtering, mapping, and set-testing

The mechanisms already desribe enable XQuery to filter a location set, or to map each member of it to zero or more other locations.

NOTE: Filtering is done by the predicate in each location term, and so there is no need for a separate select(set, predicate). If one wishes to do incremental selections it is a trivial addition, in the form just shown: a location term that makes each member of the location source a candidate.
Mapping is done by each location term, in accordance with its specific rules for how to choose candidate nodes relative to the members of the location source. If no constraint predicate is supplied, all candidates survive.

XQuery also provides access to basic information about any set, such as its size:

size(set): Returns the number of members as an integer
all(set, predicate): Returns True if the predicate evaluates to true for every member of the location set provided. If the location set is empty, all() always returns true.
any(set, predicate): Returns True if the predicate evaluates to true for at least one member of the location set provided. If the location set is empty, any() always returns false.

Node regular expressions

A key part of XQuery's ability to deal with ordered hierarchical data, is the ability to treat the actual sequences of data objects as information accessible to the query language. This is a problem that simply does not arise in relational databases, because there are no sequences inherent in the data; at best one can add sequence-number fields in a given table, do explicit sorts every time an order is needed, or some similar workaround. By definition, RDB tools do not regard sequence as an inherent part of the data (and they base a lot of optimization on this reliable assumption).

Many query systems (consequently?) do not support sequence and order constraints, or do so very weakly. But because sequence is a central way in which XML differs from traditional databases is that data occurs in ordered hierarchices, languages for unordered non-hiearchies will provide no quidance for this class of XML querying requirements.

What language, then, is already around that's good at imposing patterns on sequences of vocabulary items? The obvious answer is regular expressions. These are ubiquitous for matching sequences of characters, against desired patterns. For example, one can find start-tags for HTML headings, regardless of whether they have attributes, with a pattern such as

   <[hH][1-6]\>[!>]*>

This (purposely imposing but compact) example finds a literal less-than sign, an H or h, any digit from 1 to 6, a token boundary (such as a non-alphabetic character), and then any sequence of characters other than greater-than, followed by greater-than. While a bit painful to write, they have become ubiquitous, probably because they provide so much bang for a moderate buck and because they can be learned in stages: easy things are easy, medium things are medium, and complicated things are complicated. Nearly every tool from word-processors to Perl uses them. They are extremely well-understood technology, and easy to implement. Perhaps most relevant of all, SGML and XML use a variation on regular expressions to constrain patterns of sub-elements: content models.

XQuery includes a notion of "node regular expressions" based on one small insight: There is no reason that the sequences manipulated by regular expressions must consist of characters. They can just as well be node or element type names. The syntax must be very slightly changed to mark where one name ends and another begin; for example separating names with spaces. That done, we can easily create an XML regular expression matcher that operates at the level of node sequence. For example, to match a sequence of a TI element, followed by any of ABSTRACT, SUMMARY, or INTRO, followed by any number of Ps, this is all that is needed:

   TI [ABSTRACT SUMMARY INTRO]  P*

The simplest application of this concept is for sequences of siblings. Current research has shown ways of extending the notion to match patterns directly on multiple levels of a tree structure at once, but XQuery does not include this at this time. Thus, the pattern above must be fulfilled by a sequence of siblings, and can be fulfilled regardless of what those siblings may contain.

Such a node regex tool can do everything that SGML or XML content models can do. Most cases are trivial to convert anyway (content model "," goes to space; or-group goes to bracketed group, and so on); content model AND is harder, but not available in XML anyway.

Most of the typical regex features are potentially useful; some extremely so for XQUery, such as the () for assigning the results of subexpressions. All these have clear applications:

[] and ! (or ~, depending on your culture)
?, *, and +
.
()
^ and $
{}

Conveniently, none of the meta-characters commonly used in regular expressions is allowed in XML element type names, and so everything means about what it normally means; this also greatly reduces the need for escaping in node regular expressions as compared to the more familiar character-level ones (where all the meta-characters may also be needed as literals). ^ and $ match starts and ends of a sibling set. < and > probably don't apply, since they are really just shorthand for a bracketed set of whitespace characters or start or end of line.

Text in node regular expressions

Text can be included in a node regular expression by quoting it. Within the quotes a character-level regular expression is supported. For example, one could find whole names in a limited number of forms, even though only the SURNAME is tagged, by this expression:

   '[A-Z][a-z]+ ([A-Z]\. )*' SURNAME

These character regular expressions use precisely the same set of metacharacters, as well as \< and \> to represent token boundaries. A token boundary is defined by the transition between text characters (members of alphabets, syllabaries, and ideographic systems), as opposed to whitespace, punctuation, or complete ends of a range. Character regular expressions also need to support "\" in the traditional fashion, since any metacharacters can also be needed as literals.

A more complex example involves finding an element whose children are exactly a TI, any number of Ps, from 1 to 3 lists (UL/OL/DL), and then a SUMMARY. This is the corresponding node regular expression:

   ^ TI P* [^UL OL DL] {1,3} SUMMARY $

Node regular expressions appear in queries as a special location term, match(). So, the same match could be included in a query like this, to find all elements that it matches:

   Descendant().match(^ TI P* ([^UL OL DL] {1,3}) SUMMARY $)

When used as a location term, match() returns the locations matched by all parenthesized sub-expressions; if there are none, then the entire match is returned as a single location (which may be a span). Thus, the query will return spans of from 1 to 3 adjacent HTML lists, so long as they appear in the stated context.

When used as a predicate, what match() returns is irrelevant, since only the fact of matching affects the result.

Node regular expressions containing predicates

A second insight leads to a powerful logical extension: Specifying an element type name such as TI, is no more than a simple predicate: "is the GI 'TI'?" And indeed, XQuery already uses that fact by allowing element type names to be embedded directly in predicates, instead of requiring an explicit reference and comparison. Given that, it is logical for a node sequence pattern to permit any predicate expression anywhere a GI could go.

This allows a node sequence pattern such as this, which finds all nodes which have a TI after which every sibling contains a SECRET descendant:

   ^ TI descendant(SECRET) $

I have found no syntax conflicts with merely embedding these, so long as location terms always have their "(" immediately following with no white space. But this seems too restrictive, in which case we need to introduce something to mark predicate expressions as such. My preference is for back-quote or perhaps caret (but any character not already used in element type names or as a node regular expression meta-character would work):

   ^ TI `descendant(SECRET)` $

This is an extraordinarily powerful generalization, and requires near-trivial changes to the syntax specification. It also does not interfere with beginning users, since they can just ignore it.

This could also prove to be a handy universal solvent for XML node patterns in other contexts. And Makoto showed me a reduction of trees to specially-marked lists that allows you to use regexes to express generalized tree patterns as well. I didn't follow the proof, but the implications are potentially huge -- something called "monoids" that I ought to remember from grad school, but don't... Seongbing Park's paper in the HT98 proceedings fits in with this too, I think.

The discussion of node sequence patterns above is based on a proposal I floated to some query language experts Spetember 1, 1998. I then learned that Subramanian et al. (1995) propose a very similar notion. Murats Makoto's work with forest automata os also highly relevant, as is Seongbing Park's work reported in the Proceedings of Hypertext '98.

Conversions between types

It often proves useful to be able to use one type of data, in a context where another is required. A full table of possible conversions will be added. XQuery takes the view that where they make semantic sense, casts should be allowed.

Conversions to Boolean are listed above, including the frequently useful conversion from location set to Boolean, for testing existence.

Operator precedence

The operators have the following precedence:

unary *, - and ~
functions (such as loation terms and set operations) and {}
( )
* /
+ -
comparisons
& and |

Within each level, evaluation is strictly left-to-right.

Formal grammar

Whitespace never has any meaning in XQuery except (a) for separating tokens, for example in node regular expressions; and (b) within quoted literal strings. Each function has a particular list of parameters it may take, with particular datatypes for each. These are not distinguished in the syntax since they all have the same syntactic form.

xquery     := step
            | step "." step
step       := func          // function returning location set
            | pick
            | func pick

Functions

Instance specifications ("picks") and number

pick       := "{" ranges "}"
ranges     := range
            | range "," ranges
range      := signed
            | unsigned "-" unsigned

Node regular expressions

noderegex  := "^"? nregroups "$"?
nregroups  := repgroup
            | repgroup" "* repgroup
repgroup   := nrepiece repeatop
nrepiece   := gi
            | "."
            | "[" "^"? names "]"
            | "'" charregex "'"
            | "`" predicate "`"
            | "(" nregroup ")"
repeatop   := "*" | "?" | "+" | pick
gi         := name

(charregex is not expanded here)

Expressions (this portion of grammar is incomplete)

expr       := boolexp | compexp | valexp | numexp
predicate  := boolexp
boolexp    := boolconst
            | "(" boolexp ")"
            | boolexp [&|] boolexp
            | "~" boolexp
            | "()"            // Empty boolean evaluates to TRUE
            | compexp
            | gi
            | attrval "=" attrval
            | xquery          // XQuery is TRUE if finds anything
            | func            // Returning Boolean
attrval    := name | strconst | charregex

compexp    := numexp compop numexp
            | timeexp compop timeexp
compop     := '<' | '<=' | '=' | '<>' | '>=' | '>'
gi     := name

numexp     := intconst | realconst
            | numexp numop numexp
numop      := "+" | "-" | "*" | "/" | "%"

timeexp    := dateconst | timeconst | datetime
            | timeexp timeop timeexp
timeop     := "=" | "-"

Constants of each type

boolconst  := "%T" | "%F"

intconst   := signed | unsigned | hex | octal
realconst  := intconst
            | signed "." unsigned
            | signed "." unsigned "E" signed
signed     := "+" unsigned | "-" unsigned
unsigned   := [0-9]+
hex        := "x'" [0-9a-fA-F]+ "'"
octal      := "o'" [0-9a-fA-F]+ "'"

dateconst  := "d'" unsigned "-" unsigned "-" unsigned "'"
timeconst  := "t'" unsigned ":" unsigned ":" unsigned fraction? (" " zone)? "'"
datetime   := dateconst timeconst
fraction   := "." unsigned
zone       := signed

strconst   := "'" [!']* "'"
            | "\"" [!"]* "\""
            | """ [!"]* """   // open and close double quotes
            | "'" [!']* "'"   // open and close single quotes
name       :=                 // name characters exactly as in XML

Notes

The grammer is only approximate at this time. In particular, it has not been thoroughly examined for S/R conflicts, and it has not addressed escaping within string literals, or whitespace details. Nevertheless, it should be helpful for the more technically inclined reader to grasp the language.

It may also be desirable to provide explicit typecasts for all datatype, such as "typename(...)". This provides a logical way to force comparisons as strings vs. numbers, or truncating reals for precise comparison, and so on. It also provides an obvious way to provide case-less, accent-less, and other sophisticated string comparisons, if we introduce a sub-type(s) for string aware of such operations: caseless(string).

Character regexes can be used to achieve character-level proximity searching, such as "foo(.*){1-20}bar" to find foo within 20 characters of bar. But this is awkward and does not extend gracefully to other proximity types, or to unordered proximity. Better, is to add a wordier but more powerful proximity() operator that can specify the level of proximity (character, token, element), the maximum (and why not minimum?) distance, whether order of occurrence matters, and the actual strings of sub-queries to be correlated.

Bibliography

Abiteboul, Serge et al. 1997. "Querying Documents in Object Databases." In International Journal on Digital Libraries 1(1): 5-19.

Agosti, Maristelle and Alan Smeaton. 1996. Information Retrieval and Hypertext. Boston: Kluwer Academic Publishers. ISBN 0-7923-9710-X.

Bishop, Ann Peterson. 1997. "Digital Libraries and the Disaggregation of Knowledge: An Investigation of the Use of Journal Article Components by Researchers." National Synchronization Meeting. Digital Library Initiative, Pittsburgh PA, June 5.

Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen. 1998. Extensible Markup Language (XML) 1.0. W3C Recommendation, 10-February-1998 http://www.w3.org/TR/1998/REC-xml-19980210; (also with extensions xml, html, pdf, and ps).

Brooks, Kenneth P. 1988. "A Two-view Document Editor with User-definable Document Structure." Dissertation, Stanford University Department of Computer Science. Reprinted as Technical Report #33 by Digital Systems Research Center, www.research.digital.com/SRC/publications.

Burkowski, Forbes J. 1991. "An Algebra for Hierarchically Organized Text-Dominated Databases." Waterloo, Ontario, Canada: Department of Computer Science, University of Waterloo. Manuscript: Portions "appeared as part of a paper presented at RIAO '91: Intelligent Text and Image Handling, Barcelona, Spain, Apr. 1991."

DeRose, Steven J. 1989. "Expanding the Notion of Links." In Proceedings of Hypertext '89, Pittsburgh, PA. Baltimore, MD: Association for Computing Machinery Press.

DeRose, Steven J. 1997. The SGML FAQ Book: Understanding the Foundations of SGML and XML. Boston: Kluwer Academic Publishers. ISBN 0-7923-9943-9.

DeRose Steven J. and David G. Durand. 1994. Making HyperMedia Work: A User's Guide to HyTime. Boston: Kluwer Academic Publishers. ISBN 0-7923-9432-1.

DeRose, Steven J. and David G. Durand. 1995. "The TEI Hypertext Guidelines." In Text Encoding Initiative: Background and Context. Boston: Kluwer Academic Publishers. ISBN 0-7923-3689-5.

DeRose Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear. 1990. "What is Text, Really?" Journal of Computing in Higher Education 1 (2): 3-26. Reprinted with commentary as special issue 21(3) of The Journal of Computer Documentation, August, 1997. New York: ACM Press.

DeRose, Steven and Eve Maler (eds). 1998. "XML Linking Language (XLink)." World Wide Web Consortium Working Draft. March 1998. http://www.w3.org/TR/1998/WD-xlink-19980303.

DeRose, Steven and Eve Maler (eds). 1998. "XML Pointer Language (XPointer)." World Wide Web Consortium Working Draft. March 1998. http://www.w3.org/TR/1998/WD-xpointer-19980303.

Kahn, Paul. 1989. "Webs, Trees, and Stacks: How Hypermedia System Design Affects Hypermedia Content." In Proceedings of the Third International Conference on Human-Computer Interaction, Boston, MA, September 18-22, 1989.

Kernighan, Brian. 1990. "Issues and Tradeoffs in Document Preparation Systems". In EP90: Proceedings of the International Conference on Electronic Publishing, Document Manipulation & Typography, Gaithersburg, Maryland, September. Richard Furuta (ed.). Cambridge: Cambridge University Press: 1-16. ISBN 0521402468.

Marshall, Catherine C. and Frank M. Shipman, III. 1993. "Searching for the Missing Link: Discovering Implicit Structure in Spatial Hypertext," In Proceedings of Hypertext '93. ACM.

Park, Seongbing. 1998. "...". In Proceedings of Hypertext '98. ACM Press.

Salton, Gerald and C. Buckley. 1988. "Term Weighting Approaches in Automatic Text Retrieval. In Information Processing and Management 24 (5): 513-523.

Salton, Gerald. 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill Book Co.

Smith, Karen E., Stanley B. Zdonik. 1987. "Intermedia: A Case Study of the Differences Between Relational and Object-Oriented Database Systems. In OOPSLA '87 Proceedings, October 4-8.

Sperberg-McQueen, C. Michael and Lou Burnard (eds). 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago, Oxford: Text Encoding Initiative. Also available online from ftp://ftp-tei.uic.edu/pub/tei and many other places, most of which are pointed to from www.sil.org/sgml/acadapps.html#tei.

Subramanian, Bharathi, Theodore W. Leung, Scott L. Vandenberg, and Stanley B. Zdonik. 1995. "The AQUA Approach to Querying Lists and Trees in Object-Oriented Databases." Presented at the International Conference on Data Engineering, Taipei, Taiwan. Available from the authors.

Yu, Clement T. and Weiyi Meng. 1988. Principles of Database Query Processing for Advanced Applications. San Francisco: Morgan Kaufmann Publishers. ISBN 1-55860-434-0.

Notes:

X directly links to Y
X indirectly links to Y
X indirectly links to Y at/up to n levels
X and Y are directly linked (either direction)
X and Y are indirectly linked (either direction)
X and Y are indirectly linked at/up to n levels
Part of X directly links Y
Part of X indirectly links Y
Part of X in directly links Y at/up to n levels
X directly links part of Y
X indirectly links part of Y
X indirectly links part of Y at/up to n levels
X and Y are partially directly linked
X and Y are partially indirectly linked
X and Y are partially indirectly linked at/up to n levels
X has attribute named N
X is of same namespace throughout descendants
Get node class of X (element, pi, comment, etc)
(could also have is-a predicate for each)
Get originating entity of X
Get element type of X
Get namespace of X
Get all namespaces used within X
Get text content
Get 1st/last/nth match of regex R
Match on word roots
Hyphenated words
Find (string or regex) match with Proximity in characters
(all proximity operators may be ordered or unordered)
Find match with Proximity in words
Find match with Proximity in sentences
Find match with Proximity in elements
Truncation
Absolute value
Is-in-set
Is-subset-of
Is-proper-subset
Is-empty
X is a set with all members structurally identical to those in Y
Map each member M to expr(M) (apply)
Delete nodes with ancestors also in list
Delete nodes with children also in list
Delete nodes with descendants also in list
Delete nodes with all ancestors also in list
Delete nodes with all children also in list
Delete nodes with all descendants also in list
Replace sibling sequences by parents

Measuring sibling/anc distance and cardinality. Can do via sibling-distance and element-distance (latter should take a predicate for what things to count, so you can leave out digressions, like PIs, comments, FOOTNOTEs, etc, nomatter what predicate is required to characterize them.

Constraining sibling distance -- just test the same things. But, you may need to be able to constrain between adjacent candidates as the expression gets iterated over a location set. Perhaps add previous-loc/next-loc to be able to get at them? Hmmmm.

Way to relate successive hits in a list -- such as selecting the span from one MILESTONE or PAGEBREAK or HR to the next, or testing that adjacent instances of something share some characteristics.

Let() to bind multiple variables for a sub-query, and then a way to refer to them (maybe just have a new absolute location term that names the variable, which then becomes a location source.