XQuery: A unified syntax for linking and querying general XML documents

By Steven J. DeRose, Chief Scientist, Inso Corporation and the Brown Univ. Scholarly Technology Group.

Written September 24, 1998. Last updated October 8, 1998

Introduction

This document proposes a query language syntax for XML documents, called ZQuery. Such a query language has quite different requirements than traditional languages; much more different than is commonly appreciated. Many past proposals have taken a basically relational query language (typically SQL), and modified it by the addition of a few constructs: typically a "contains" operator and some features for matching strings within text chunks against regexes, or against word-roots. Such features, while needed, are not enough. The hard problem arises because the most basic design principles of relational databses, do not hold for XML documents. These include:

Query language for such data are designed to take good advantage of these characteristics, and derive much of their power from that. However, for exactly that reason they do not work well for queries on XML, where data is richly hierarchical, ordered, and filled to the brim with aggregate objects with much insternal structure: in keeping with the human-language objects they represent.

This document builds directly on my earlier Queries on Links and Hierarchies, which lays out the concrete requirements for operations to be supported in a full-powered XML query language, but provides no syntax. I consider that the correct order of design: If you have not laid out in detail what it is you plan to build, you cannot even tell whether your concrete plan makes sense or not.

The syntax here builds directly on XPointer, for several reasons.

Datatypes

XQUery is aware of data objects of several atomic datatypes, like most other query languages. It also has a well-developed notion of aggregate types, specifically designed for XML, where documents and their portions are ordered hierarchies of typed objects, each with properties (XML calls the properties "attributes", a slightly different meaning of the term than in database theory!). All the datatypes used in XQuery are defined here, and given symbolic names for convenient later reference:

Atomic datatypes, along with their format when appearing as constants in XQuery, include:

NOTE: DO we need special syntax to mark GIs, or date-times? It would be nice to be able to just drop a GI in bare as a predicate. It should be unambiguous to treat a name as a GI when it occurs in a Boolean context; but the grammar needs to be thoroughly checked to be sure there are no ambiguous cases. If there is some insoluble ambiguity, the easiest solution would be to add some prefix operator (maybe "*") to mark names used as GIs. The idea is to keep it very lightweight because GIs will likely be used so very often.

Data objects types include:

Finally, there is a special type for attributes, since they are ancillary to the basic ordered tree structure (they are not "children" or "ancestors" in the tree, and they are not ordered). This could be considered a sub-type of node, but it is probably treated better as a separate type, since it does not support many methods that all other nodes do, such as ordering comparisons; and lack some relationships that other nodes have, such as ancetors (they have an owning element, which is not quite the same thing). Thus the type:

These types cover all the data that occurs in documents.

The type for a returned location (often called a search hit), is called a location. These include references to any node or attribute, or to any selection (a single contiguous span, typically denoted by its starting and ending point).

Locations arise from queries, and come in sets. Singleton locations do not figure directly in the query language, since any query returns a set even if the set is of size 0 or 1. Location sets are always ordered, in the order their members start in the document (thus, a node's parent element precedes it in this order. Nested location sets are not allowed; that is, a location set cannot contain another location set as a member (it can certainly contain members that refer to data objects that are themselves aggregates). Recall that locations and location sets are not data primitives; they are the results of queries.

XQuery provides this special type for the results of queries, which is broader than any of the other types. This is a very precisely designed extension to the traditional notion of closure in database theory, which is necessary for several reasons, among them:

Basic XQuerying

An XQuery consists of a sequence of steps, using the "location terms" of XPointer. Each step selects nodes, either in absolute terms or based on the prior set (initially, relative to the root of the document(s) being queried). This iterative approach enables easy queries to be handled with very little overhead, while permitting arbitrarily complex queries to be specified clearly as well.

In effect, a step maps a prior location set (the "location source"), into a (generally different) location set. A step does this by applying some process to the input set; the particular processes available are built into XQuery. All of the XPointer location terms are identically available in XQuery; the most relevant ones here are:

id selects the element with the given id ?
child selects from among the direct children of the current location / (or better, simply use as the default)
descendant selects from among all nodes in the subtree headed at the current location //
ancestor selects from among the containers of the current location^
psibling selects from among preceding siblings <
fsibling selects from among following siblings >
preceding selects from among all preceding nodes <<
following selects from among all following nodes >>
attr selects from among the attributes of an element @
string selects a range by matching a string $
span selects a range spanning two (embedded) XPointers

It may be useful to provide shorter names as alternatives, such as abbreviations or special characters, to specify each of these functions. With punctuation instead of names the dot may also be left out without ambiguity. However, the desire for brevity should be balanced by the need for clarity, implementability of the grammar, and by the limited set of characters permitted within URLs and the need for them for other purposes.

XQuery adds a few more location terms, and a few more ways to combine the existing ones.

The most usual operation a step does is to iterate over the members of the location source set, and for each one, either

This process is repeated for each step; steps are joined by ".", much like "path expressions" in OQL. This is because they typically traverse paths through the XML structure to a desired set of final data objects.

The initial location set is one or more documents. If more than one, they must be specified in some order, and that relative order will be maintained in any location sets that contain members from more than one document.

For example, in the document(s) queried this XQuery finds all FOOTNOTE elements on which the TYPE attribute has the value 'CITATION', and then iterates over them to find all their REF children):

   Descendant(FOOTNOTE & TYPE='CITATION').(REF)

Assume a schema in which an individual FOOTNOTE element may consolidate a number of references, not just one. In this context, a bare name in a Boolean context represents a test against the element type of the current candidate (as for FOOTNOTE above), while a bare name in a string context represents the value of the named attribute of the current candidate (as for TYPE above).

This is precisely XPointer syntax, except that it assumes approval of some syntax details already proposed: changing the sequence-of-parameters for expressing "and", to an actual operator (&); changing the default location term from "last used" to "child"; and moving the instance-number parameter out (the advantages of this are discussed below).

A location term determines what candidate nodes are to be examined at all, relative to each location source member. In this case, the use of Descendant indicates that (for each member of the location source) all nodes in the subtree are to be examined.

Obviously the predicate used to filter such candidates can be simple or arbitrarily complex; it is merely a Boolean expression. The simplest case is the empty predicate, "()", which counts as true and so returns all the candidates the location term type allows. For example, since "child" is the defailt location term, either these returns all children of the location source:

   child()
   ()

One particularly useful case is that an entire XQuery can be embedded: It is evaluated (like the rest of the predicate) for each candidate, and counts as true if it returns anything: that is, if the result set it produces is non-null. So this finds all SEC elements that have at least one descendant (however deep) with attribute LEVEL='SECRET':

   descendant(SEC & descendant(LEVEL='SECRET')

TECHNOID NOTE: The evaluation of such an embedded query is identical to what would happen if it appeared as the next step (that is, after the dot instead of inside the parenthesized predicate). However, the result when used within a predicate is only tested for whether it is empty. A smart implementation will therefore halt as soon as it finds even one result for an embedded sub-query.

The next query does the same thing as the first, but picks the first two REFs from each FOOTNOTE, and iterates over them to findany AUTHOR elements which each refers to (via XPointers).

   Descendant(FOOTNOTE & TYPE='CITATION').(REF){1-2}.link(role=AUTHOR)

Notice that the range specification is not separated from (REF) by a dot -- it is therefore part of the same step. The significance of this is that it applies during the evaluation of that step, and picks the first two REFs found for each member of the prior location source -- that is, it will attempt to find twice as many REFs as the number of FOOTNOTEs found in the prior step.

One might instead want to set an absolute maximum on the number of REFs to be found, regardless of how many or few FOOTNOTES it take to accumulate them. To do this simply delay the range until the next step, when all the REFs found are in a single location set:

   Descendant(FOOTNOTE & TYPE='CITATION').(REF).{1-2}

Nothing prevents doing both when that is the desired behavior.

Any XPointer location terms may be used in this manner. An embedded XPointer results in a location, not a string; but one can test the actual text content of that location merely by comparing the location against a string (an implicit cast). The content can be the concatenated text of an entire element, subtree, or span; the value of an attribute; the contents of a PI, comment, or other node. The resulting string can then be compared, matched, or otherwise tested to resolve the predicate. For example:

   Descendant(FOOTNOTE & TYPE = ancestor(CHAPTER).attr(TYPE))

Comparing two locations without casting to strings, determines whether they are indeed the very same location. To compare the content of two locations, they must be cast from locations to their content strings explicitly, using the "*" operator (by analogy to C). For example, to locate FOOTNOTEs where the TYPE of their AUTHOR child matches the AUTHOR of the containing CHAPTER, you could do this:

   Descendant(FOOTNOTE & (child(AUTHOR).attr(TYPE) =
*(ancestor(CHAPTER).attr(AUTHOR))

NOTE: Technically, a bare attribute name in a string comparison is merely shorthand for "*attr(name)". But it's awfully useful shorthand. Some might prefer "@name" because it makes the distinction between element types and attributes more obvious; this would not be objectionable.

NOTE: A means may also be desirable for comparing values across candidates: finding sequences of 2 (or n) FOOTNOTEs which all share the same TYPE, for example. I have not addressed this yet.

Links in queries

Following Abiteboul (1997), one can integrate links into queries; indeed failing to do this for an XMl query language seems absurd. XQuery expresses this by adding a new location term beyond those provided already by XPointer: link().

Link is just like other location terms in that, given a location source it generates a list of candidate locations. However, those locations are not related to the location source members by the "genetics" of an XML document, but rather by the XPointer linkages known to the document. One can think of this as just another kind of relationship. Thus, this XQuery does what the one above does, but for each of the REFs that is eventually found, all locations it links to are returned:

   Descendant(FOOTNOTE & TYPE='CITATION').(REF).{1-2}.link()

This example locates document portions based on non-link information, but then follows the link to determine what to return. Links can also be used in a different way, as part of predicates just like any other location terms. Probable uses for them would include filtering the links by role or other properties defined in XLink.

TECHNICAL NOTE: David Durand has correctly pointed out the need to have queries find either links per se, or data linked to. This distinction can be made either by using normal XQUery constructs to point to the nodes consistituting one or the other, or by building the distinction directly into the linking support (either by additional location terms for linking, or by a parameter).

For symmetry in the face of XLink's bidirectional linking capabilities, a location term linked() is added: it returns all those nodes that link to the location source member.

NOTE: The link() and linked() terms need some parameters in addition to a predicate. In particular, support is needed for indirect linkage (hopefully of limited levels), and for distinguishing cases where the link (at least in the reverse case) is partial: should one include nodes that link to part of the location source member, but not all of it? XQuery does not yet address this.

Operators and Expressions

As in any logical language, constants and data objects can be compared, and comparisons can be combined via logical operators and parentheses.

XQuery provides the normal comparison operators, <, <=, =, <>, >, and >=. These can be applied to integers, booleans, dates, times, and date-times, and return a Boolean value. They can also apply to nodes and to singleton locations, in which case they constitute a comparison of order (only the same actual node, or locations that refer to the same actual node, compare equal). the operands of a comparison must be of the same type, or be castable to the same type under rules defined below.

TECHNOID NOTE: Node locations form a full ordering based on start points, since no two distinct nodes have the same starting point. However, selections cannot be fully ordered based solely on starting point: two selections may start at precisely the same point but have different extents. In such a case, the span that ends sooner compares as less than the other.

XQuery provides the normal logical operators, &, |, and unary ~ for not. They take and return Boolean values. However, most types can be cast to a Boolean, as follows:

Set operations

XQuery provides the usual operations on sets; their arguments are location sets, and so are generally embedded queries. The operators are:

NOTE: It may be reasonable to express union() via the "+" operator applied to sets; difference() as "-", and intersection as "*". In keeping with the international focus of XML, the Unicode characters for union, intersection, and difference are allowed as infix operators.

Filtering, mapping, and set-testing

The mechanisms already desribe enable XQuery to filter a location set, or to map each member of it to zero or more other locations.

XQuery also provides access to basic information about any set, such as its size:

Node regular expressions

A key part of XQuery's ability to deal with ordered hierarchical data, is the ability to treat the actual sequences of data objects as information accessible to the query language. This is a problem that simply does not arise in relational databases, because there are no sequences inherent in the data; at best one can add sequence-number fields in a given table, do explicit sorts every time an order is needed, or some similar workaround. By definition, RDB tools do not regard sequence as an inherent part of the data (and they base a lot of optimization on this reliable assumption).

Many query systems (consequently?) do not support sequence and order constraints, or do so very weakly. But because sequence is a central way in which XML differs from traditional databases is that data occurs in ordered hierarchices, languages for unordered non-hiearchies will provide no quidance for this class of XML querying requirements.

What language, then, is already around that's good at imposing patterns on sequences of vocabulary items? The obvious answer is regular expressions. These are ubiquitous for matching sequences of characters, against desired patterns. For example, one can find start-tags for HTML headings, regardless of whether they have attributes, with a pattern such as

   <[hH][1-6]\>[!>]*>

This (purposely imposing but compact) example finds a literal less-than sign, an H or h, any digit from 1 to 6, a token boundary (such as a non-alphabetic character), and then any sequence of characters other than greater-than, followed by greater-than. While a bit painful to write, they have become ubiquitous, probably because they provide so much bang for a moderate buck and because they can be learned in stages: easy things are easy, medium things are medium, and complicated things are complicated. Nearly every tool from word-processors to Perl uses them. They are extremely well-understood technology, and easy to implement. Perhaps most relevant of all, SGML and XML use a variation on regular expressions to constrain patterns of sub-elements: content models.

XQuery includes a notion of "node regular expressions" based on one small insight: There is no reason that the sequences manipulated by regular expressions must consist of characters. They can just as well be node or element type names. The syntax must be very slightly changed to mark where one name ends and another begin; for example separating names with spaces. That done, we can easily create an XML regular expression matcher that operates at the level of node sequence. For example, to match a sequence of a TI element, followed by any of ABSTRACT, SUMMARY, or INTRO, followed by any number of Ps, this is all that is needed:

   TI [ABSTRACT SUMMARY INTRO]  P*

The simplest application of this concept is for sequences of siblings. Current research has shown ways of extending the notion to match patterns directly on multiple levels of a tree structure at once, but XQuery does not include this at this time. Thus, the pattern above must be fulfilled by a sequence of siblings, and can be fulfilled regardless of what those siblings may contain.

Such a node regex tool can do everything that SGML or XML content models can do. Most cases are trivial to convert anyway (content model "," goes to space; or-group goes to bracketed group, and so on); content model AND is harder, but not available in XML anyway.

Most of the typical regex features are potentially useful; some extremely so for XQUery, such as the () for assigning the results of subexpressions. All these have clear applications:

Conveniently, none of the meta-characters commonly used in regular expressions is allowed in XML element type names, and so everything means about what it normally means; this also greatly reduces the need for escaping in node regular expressions as compared to the more familiar character-level ones (where all the meta-characters may also be needed as literals). ^ and $ match starts and ends of a sibling set. < and > probably don't apply, since they are really just shorthand for a bracketed set of whitespace characters or start or end of line.

Text in node regular expressions

Text can be included in a node regular expression by quoting it. Within the quotes a character-level regular expression is supported. For example, one could find whole names in a limited number of forms, even though only the SURNAME is tagged, by this expression:

   '[A-Z][a-z]+ ([A-Z]\. )*' SURNAME

These character regular expressions use precisely the same set of metacharacters, as well as \< and \> to represent token boundaries. A token boundary is defined by the transition between text characters (members of alphabets, syllabaries, and ideographic systems), as opposed to whitespace, punctuation, or complete ends of a range. Character regular expressions also need to support "\" in the traditional fashion, since any metacharacters can also be needed as literals.

A more complex example involves finding an element whose children are exactly a TI, any number of Ps, from 1 to 3 lists (UL/OL/DL), and then a SUMMARY. This is the corresponding node regular expression:

   ^ TI P* [^UL OL DL] {1,3} SUMMARY $

Node regular expressions appear in queries as a special location term, match(). So, the same match could be included in a query like this, to find all elements that it matches:

   Descendant().match(^ TI P* ([^UL OL DL] {1,3}) SUMMARY $)

When used as a location term, match() returns the locations matched by all parenthesized sub-expressions; if there are none, then the entire match is returned as a single location (which may be a span). Thus, the query will return spans of from 1 to 3 adjacent HTML lists, so long as they appear in the stated context.

When used as a predicate, what match() returns is irrelevant, since only the fact of matching affects the result.

Node regular expressions containing predicates

A second insight leads to a powerful logical extension: Specifying an element type name such as TI, is no more than a simple predicate: "is the GI 'TI'?" And indeed, XQuery already uses that fact by allowing element type names to be embedded directly in predicates, instead of requiring an explicit reference and comparison. Given that, it is logical for a node sequence pattern to permit any predicate expression anywhere a GI could go.

This allows a node sequence pattern such as this, which finds all nodes which have a TI after which every sibling contains a SECRET descendant:

   ^ TI descendant(SECRET) $

I have found no syntax conflicts with merely embedding these, so long as location terms always have their "(" immediately following with no white space. But this seems too restrictive, in which case we need to introduce something to mark predicate expressions as such. My preference is for back-quote or perhaps caret (but any character not already used in element type names or as a node regular expression meta-character would work):

   ^ TI `descendant(SECRET)` $

This is an extraordinarily powerful generalization, and requires near-trivial changes to the syntax specification. It also does not interfere with beginning users, since they can just ignore it.

This could also prove to be a handy universal solvent for XML node patterns in other contexts. And Makoto showed me a reduction of trees to specially-marked lists that allows you to use regexes to express generalized tree patterns as well. I didn't follow the proof, but the implications are potentially huge -- something called "monoids" that I ought to remember from grad school, but don't... Seongbing Park's paper in the HT98 proceedings fits in with this too, I think.

The discussion of node sequence patterns above is based on a proposal I floated to some query language experts Spetember 1, 1998. I then learned that Subramanian et al. (1995) propose a very similar notion. Murats Makoto's work with forest automata os also highly relevant, as is Seongbing Park's work reported in the Proceedings of Hypertext '98.

Conversions between types

It often proves useful to be able to use one type of data, in a context where another is required. A full table of possible conversions will be added. XQuery takes the view that where they make semantic sense, casts should be allowed.

Conversions to Boolean are listed above, including the frequently useful conversion from location set to Boolean, for testing existence.

Operator precedence

The operators have the following precedence:
unary *, - and ~
functions (such as loation terms and set operations) and {}
( )
* /
+ -
comparisons
& and |
Within each level, evaluation is strictly left-to-right.

Formal grammar

Whitespace never has any meaning in XQuery except (a) for separating tokens, for example in node regular expressions; and (b) within quoted literal strings. Each function has a particular list of parameters it may take, with particular datatypes for each. These are not distinguished in the syntax since they all have the same syntactic form.

xquery     := step
            | step "." step
step       := func          // function returning location set
            | pick
            | func pick

Functions

func := funcname "(" parms ")" | "()" // shorthand for child() parms := expr | expr "," expr funcname := numfunc | boolfunc | setfunc | locfunc numfunc := "size" // takes locset boolfunc := "all" // takes locset+ | "any" // takes locset+ setfunc := // takes locset+ | "union" | "intersection" | "difference" | "complement" locfunc := relterm | miscterm | added relterm := // takes predicate or empty | "child" | "/" | "descendant" | "//" | "ancestor" | "^" | "psibling" | "<" | "fsibling" | ">" | "preceding" | "<<" | "following" | ">>" miscterm := "id" // takes name | "string" // takes string, signed int, unsigned int | "attr" // takes name | "$" // = "string" | "@" // = "attr" | "span" // takes locset, locset added := | "match" // takes noderegex as a string | "link" // takes boolean

Instance specifications ("picks") and number

pick       := "{" ranges "}"
ranges     := range
            | range "," ranges
range      := signed
            | unsigned "-" unsigned

Node regular expressions

noderegex  := "^"? nregroups "$"?
nregroups  := repgroup
            | repgroup" "* repgroup
repgroup   := nrepiece repeatop
nrepiece   := gi
            | "."
            | "[" "^"? names "]"
            | "'" charregex "'"
            | "`" predicate "`"
            | "(" nregroup ")"
repeatop   := "*" | "?" | "+" | pick
gi         := name

(charregex is not expanded here)

Expressions (this portion of grammar is incomplete)

expr       := boolexp | compexp | valexp | numexp
predicate  := boolexp
boolexp    := boolconst
            | "(" boolexp ")"
            | boolexp [&|] boolexp
            | "~" boolexp
            | "()"            // Empty boolean evaluates to TRUE
            | compexp
            | gi
            | attrval "=" attrval
            | xquery          // XQuery is TRUE if finds anything
            | func            // Returning Boolean
attrval    := name | strconst | charregex

compexp    := numexp compop numexp
            | timeexp compop timeexp
compop     := '<' | '<=' | '=' | '<>' | '>=' | '>'
gi     := name

numexp     := intconst | realconst
            | numexp numop numexp
numop      := "+" | "-" | "*" | "/" | "%"

timeexp    := dateconst | timeconst | datetime
            | timeexp timeop timeexp
timeop     := "=" | "-"

Constants of each type

boolconst  := "%T" | "%F"

intconst   := signed | unsigned | hex | octal
realconst  := intconst
            | signed "." unsigned
            | signed "." unsigned "E" signed
signed     := "+" unsigned | "-" unsigned
unsigned   := [0-9]+
hex        := "x'" [0-9a-fA-F]+ "'"
octal      := "o'" [0-9a-fA-F]+ "'"

dateconst  := "d'" unsigned "-" unsigned "-" unsigned "'"
timeconst  := "t'" unsigned ":" unsigned ":" unsigned fraction? (" " zone)? "'"
datetime   := dateconst timeconst
fraction   := "." unsigned
zone       := signed

strconst   := "'" [!']* "'"
            | "\"" [!"]* "\""
            | """ [!"]* """   // open and close double quotes
            | "'" [!']* "'"   // open and close single quotes
name       :=                 // name characters exactly as in XML

Notes

The grammer is only approximate at this time. In particular, it has not been thoroughly examined for S/R conflicts, and it has not addressed escaping within string literals, or whitespace details. Nevertheless, it should be helpful for the more technically inclined reader to grasp the language.

It may also be desirable to provide explicit typecasts for all datatype, such as "typename(...)". This provides a logical way to force comparisons as strings vs. numbers, or truncating reals for precise comparison, and so on. It also provides an obvious way to provide case-less, accent-less, and other sophisticated string comparisons, if we introduce a sub-type(s) for string aware of such operations: caseless(string).

Character regexes can be used to achieve character-level proximity searching, such as "foo(.*){1-20}bar" to find foo within 20 characters of bar. But this is awkward and does not extend gracefully to other proximity types, or to unordered proximity. Better, is to add a wordier but more powerful proximity() operator that can specify the level of proximity (character, token, element), the maximum (and why not minimum?) distance, whether order of occurrence matters, and the actual strings of sub-queries to be correlated.

Glossary

candidate

cast

filter

location

location source

location set

location term

map

node

node regular expression

pick

predicate

Bibliography

Abiteboul, Serge et al. 1997. "Querying Documents in Object Databases." In International Journal on Digital Libraries 1(1): 5-19.

Agosti, Maristelle and Alan Smeaton. 1996. Information Retrieval and Hypertext. Boston: Kluwer Academic Publishers. ISBN 0-7923-9710-X.

Bishop, Ann Peterson. 1997. "Digital Libraries and the Disaggregation of Knowledge: An Investigation of the Use of Journal Article Components by Researchers." National Synchronization Meeting. Digital Library Initiative, Pittsburgh PA, June 5.

Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen. 1998. Extensible Markup Language (XML) 1.0. W3C Recommendation, 10-February-1998 http://www.w3.org/TR/1998/REC-xml-19980210; (also with extensions xml, html, pdf, and ps).

Brooks, Kenneth P. 1988. "A Two-view Document Editor with User-definable Document Structure." Dissertation, Stanford University Department of Computer Science. Reprinted as Technical Report #33 by Digital Systems Research Center, www.research.digital.com/SRC/publications.

Burkowski, Forbes J. 1991. "An Algebra for Hierarchically Organized Text-Dominated Databases." Waterloo, Ontario, Canada: Department of Computer Science, University of Waterloo. Manuscript: Portions "appeared as part of a paper presented at RIAO '91: Intelligent Text and Image Handling, Barcelona, Spain, Apr. 1991."

DeRose, Steven J. 1989. "Expanding the Notion of Links." In Proceedings of Hypertext '89, Pittsburgh, PA. Baltimore, MD: Association for Computing Machinery Press.

DeRose, Steven J. 1997. The SGML FAQ Book: Understanding the Foundations of SGML and XML. Boston: Kluwer Academic Publishers. ISBN 0-7923-9943-9.

DeRose Steven J. and David G. Durand. 1994. Making HyperMedia Work: A User's Guide to HyTime. Boston: Kluwer Academic Publishers. ISBN 0-7923-9432-1.

DeRose, Steven J. and David G. Durand. 1995. "The TEI Hypertext Guidelines." In Text Encoding Initiative: Background and Context. Boston: Kluwer Academic Publishers. ISBN 0-7923-3689-5.

DeRose Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear. 1990. "What is Text, Really?" Journal of Computing in Higher Education 1 (2): 3-26. Reprinted with commentary as special issue 21(3) of The Journal of Computer Documentation, August, 1997. New York: ACM Press.

DeRose, Steven and Eve Maler (eds). 1998. "XML Linking Language (XLink)." World Wide Web Consortium Working Draft. March 1998. http://www.w3.org/TR/1998/WD-xlink-19980303.

DeRose, Steven and Eve Maler (eds). 1998. "XML Pointer Language (XPointer)." World Wide Web Consortium Working Draft. March 1998. http://www.w3.org/TR/1998/WD-xpointer-19980303.

Kahn, Paul. 1989. "Webs, Trees, and Stacks: How Hypermedia System Design Affects Hypermedia Content." In Proceedings of the Third International Conference on Human-Computer Interaction, Boston, MA, September 18-22, 1989.

Kernighan, Brian. 1990. "Issues and Tradeoffs in Document Preparation Systems". In EP90: Proceedings of the International Conference on Electronic Publishing, Document Manipulation & Typography, Gaithersburg, Maryland, September. Richard Furuta (ed.). Cambridge: Cambridge University Press: 1-16. ISBN 0521402468.

Marshall, Catherine C. and Frank M. Shipman, III. 1993. "Searching for the Missing Link: Discovering Implicit Structure in Spatial Hypertext," In Proceedings of Hypertext '93. ACM.

Park, Seongbing. 1998. "...". In Proceedings of Hypertext '98. ACM Press.

Salton, Gerald and C. Buckley. 1988. "Term Weighting Approaches in Automatic Text Retrieval. In Information Processing and Management 24 (5): 513-523.

Salton, Gerald. 1983. Introduction to Modern Information Retrieval. New York: McGraw Hill Book Co.

Smith, Karen E., Stanley B. Zdonik. 1987. "Intermedia: A Case Study of the Differences Between Relational and Object-Oriented Database Systems. In OOPSLA '87 Proceedings, October 4-8.

Sperberg-McQueen, C. Michael and Lou Burnard (eds). 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago, Oxford: Text Encoding Initiative. Also available online from ftp://ftp-tei.uic.edu/pub/tei and many other places, most of which are pointed to from www.sil.org/sgml/acadapps.html#tei.

Subramanian, Bharathi, Theodore W. Leung, Scott L. Vandenberg, and Stanley B. Zdonik. 1995. "The AQUA Approach to Querying Lists and Trees in Object-Oriented Databases." Presented at the International Conference on Data Engineering, Taipei, Taiwan. Available from the authors.

Yu, Clement T. and Weiyi Meng. 1988. Principles of Database Query Processing for Advanced Applications. San Francisco: Morgan Kaufmann Publishers. ISBN 1-55860-434-0.


Notes:

Measuring sibling/anc distance and cardinality. Can do via sibling-distance and element-distance (latter should take a predicate for what things to count, so you can leave out digressions, like PIs, comments, FOOTNOTEs, etc, nomatter what predicate is required to characterize them.

Constraining sibling distance -- just test the same things. But, you may need to be able to constrain between adjacent candidates as the expression gets iterated over a location set. Perhaps add previous-loc/next-loc to be able to get at them? Hmmmm.

Way to relate successive hits in a list -- such as selecting the span from one MILESTONE or PAGEBREAK or HR to the next, or testing that adjacent instances of something share some characteristics.

Let() to bind multiple variables for a sub-query, and then a way to refer to them (maybe just have a new absolute location term that names the variable, which then becomes a location source.