Queries on Links and Hierarchies

Steven J. DeRose, Inso and Brown University

C. M. Sperberg-McQueen, W3C and University of Illinois at Chicago

Bill Smith, Sun Microsystems

Last updated November 18, 1998.

Written September 24, 1998. This paper builds on prior work in "Generalized query desiderata" (April 22, 1998), "XQuery notes" (May 5, 1998), "Query language feature table" (August, 1998), and other collected notes.

Table of Contents

1 Introduction

This document briefly sets out functional requirements for languages to characterize and retrieve structure and content in linked hierarchies (such as XML documents). This kind of data poses more complex requirements than some others, because not only content but also structure and linkage must be available to queries. Queries on structured data are important for several domains: information retrieval, database searching, hypermedia link traversal, stylesheet rule selection, and document validation using XML schemas. Though commonly considered separate, all these domains have similar requirements for query languages. They can all be described in terms of a single abstract model of query semantics, and they all require similar sets of operators: access, comparison, testing, and retrieval of parts of XML, HTML, and SGML documents -- or more generally, searching or characterizing locations in multi-level hierarchies with links. The expression and implementation of query languages may -- and probably should -- vary among these domains, but they can benefit from a well-defined common set of primitive concepts.

This document contains three sections:

Conspicuously missing is any proposal for the syntax of a query language; until the required scope of the language is clarified, syntax design would be premature. Our examples are given in a pseudo-code which is emphatically not intended as a proposal for the syntax of a real query language.

1.1 Terminology and basic concepts

We use the term attribute here solely in the XML sense; the term has a quite different meaning in database theory. The term entity is also used differently in the two fields, but we don't use that term here.

The term structured document is used in a wide variety of different ways. Here, we use it (without loss of generality, we think) to mean XML documents, which are ordered trees of class instances called elements. This meaning is quite different from what is called `structured data' in the relational database world: it has properties of order, hierarchy, repetition, and heterogeneity of instances that RDBs do not share (more below).

What we here call general user querying is also commonly called structured information retrieval. It is intended to provide users who know something about the content and structure of a document or collection with the ability to use that knowledge to characterize and retrieve desired portions of it.

By schema validation, we mean the process of checking that a document conforms to a schema, in particular one expressed in the XML schema language now being developed by a W3C work group.

1.2 Locations, predicates and queries

We assume some notion of locations in a document; informally, one can think of a location as a portion of a document (e.g. an XML element or a string of characters in an element's content), or a position in it. We will treat locations as an abstract data type, and not describe further precisely how they might or should be implemented. Different query languages may define the idea of locations differently, with important results for the utility of the language.

A predicate is some expression that can be evaluated to true or false, with respect to some document location. Predicates are the main means of expressing what portions of data a query or other retrieval task should or should not match. For example, is an element of type P or occurs within a FORM are simple predicates that might apply to locations within a document.

For any predicate, there may be zero or more locations in a given set of documents which satisfy it. For any given location in a document, there will be (in any interesting query language) a large (perhaps infinite) set of predicates satisfied by it. There is thus a many-to-many mapping between predicates in a query language and locations in a set of documents.

The set of locations which satisfy a predicate is the extension of the predicate. The set of characteristics of a location by virtue of which it satisfies the predicate is the intension of the predicate. Query languages may vary with respect to both the extensions and the intensions of their predicates.

A query is a request for information about a predicate-location mapping; in practice, it will take very different forms depending on the domain. See below.

2 Relational and document querying

Documents impose requirements on querying very different from those of relational databases. The basic characteristics of tables and hierarchies are different, and those basic characteristics lead to differences in many of the design decisions for any query language.

2.1 Ordering

In RDBs, data are inherently unordered: there is by definition no relative order to the fields in a record, the records in a table, or the tables in a schema. Obviously things may be stored in some order, but this ordering is not significant (not information) as far as the database is concerned. The user can sort while making reports, but these orders are not preserved. On the contrary: database management systems are free to shuffle things around as they see fit: that is a source of many powerful optimizations in RDBMS, which should not be impeded.

In documents, the order of words of a paragraph, the order of items of a list, and the order of sections of a chapter are part of the information: the order of all must be preserved, and made accessible to the user. A query language that gave no access to information about the order of elements, or which gave only inconvenient access to information about sequence, would not be adequate for XML data.

It is occasionallly necessary to preserve or impose a given order of records in RDBMS; in such cases, one must add serial numbers, sort the records in a result (possibly repeatedly), and possibly perform other expensive tasks in order to impose or preserve the ordering. Likewise, there are cases of unordered data in documents (such as the components of a bibliography entry), where order is imposed by layout and style constraints, not by the meaning of the data itself. It is worth noting that such unusual boundary cases are handled only poorly by tools on each side of the relational/documentary divide.

2.2 Hierarchy

By definition, RDB tables do not contain other tables or fields, records do not contain tables, and each field contains only an atomic value. Fields may of course have as their value a key or other data that can be used to retrieve another record. This is one special case of linking, but it is not a case of document structuring (see DeRose 1989 for more on the distinction). Relational queries do not have operators for containment (particularly recursive containment, which is the natural form for a containment query against an XML document), they do not support text-matches that reach across fields, and so on. Simulating these very common document requirements is extraordinarily hard in relational systems.

The XML document model is crucially hierarchical: a typical XML document type may define divisions, sections, nested lists, and hundreds of other text objects with internal structure.[1] Useful document queries often require knowledge of the hierarchical structure: the query find chapters with titles containing the word "Syntax" in books with titles containing the phrase "programming language" is likely to find discussions of programming-language syntax and formal grammars for C, Java, Perl, etc. The query find chapters with titles containing the phrase "programming language(s)" within books with titles containing the word "syntax", in contrast, is likely to find discussions of programming languages which are useful in the study of natural-language syntax.[2]

2.3 Recursive hierarchies

Probably the most common query against documents is finding instances of a word or element type `inside' another particular type of element, regardless of whether other element types intervene. For example, finding footnotes (<fn>) that are inside chapters (<chap>), even if there are <p>s, <list>s, and so on (vertically) in between. In a naive and non-redundant relational representation of a document structure,[3] this relationship cannot be expressed at all. At best, one can write a query for direct containment, or a query twice as long to handle cases with one intervening node, or a query three times as long to handle two levels, and so on.[4] For the default SGML limit on nesting levels (28), such a query takes 25 pages of SQL (and is apt to be slow with typical query optimizers). For XML there is no level limit, so an accurate SQL query against the naive representation of a tree in third normal form is not inconvenient but impossible.

In practice, one can solve this problem most conveniently by including redundant non-normalized data in the tables: the elements table might record not only the parent but also all the ancestors of each element, either as tokens in a string field, or as a host of ancestor fields, or ancestors might appear as rows in an ancillary table of descendant-ancestor-pairs. The 25-page recursive query is now dramatically simpler,[5] though the cost in space and in complexity of query evaluation may be high.

2.4 Repetition

In RDBs, by definition no field can be repeated within a record. One may define multiple fields with similar names or purposes, which users are invited to think of as repetitions (address1 and address2), but the database has no knowledge that they are related. Likewise, one may pack repetitions into a single (string) field ("zip 02906 12345"), but this makes data verification, retrieval, and many other required operations difficult or impossible. Otherwise, it is necessary to create secondary tables and link them via key fields; this imposes significant costs in size and (for typical RDMS implementations) speed.

In XML documents it is not only possible for an element to have many sub-elements of the same name; it is the norm. It would be absurd to require elements to contain only one paragraph each, or to create a separate element type for each paragraph sub-element. It would be similarly absurd to prohibit re-use of the same content in another place (as is very common for boilerplate text, warnings, etc.).

2.5 Datatypes

Datatypes in traditional databases are generally limited to atomic types: INTEGER, REAL, BOOLEAN, DATE, CHARACTER, and STRING (often fixed-length). Anything else is a `blob' about which the database knows nothing (it cannot version-manage, display, edit, search, or otherwise process it other than keeping it around and producing it on demand). Aggregates exist only in the form of (rows in) tables.

Datatypes in XML documents include mainly aggregates, such as elements. XML 1.0 does not define anything like a useful set of atomic datatypes; when integers, dates, etc. appear in XML documents, they are normally encoded as strings. (If not, then the document is probably not actually XML, though it may be a very fine internal format to use with XML). This affects the most natural sizes of numbers (powers of two are not significant to XML), as well as the ever-present need to derive other data types (such as dates) from strings (such as "9/23/98" or "Jueves, 24/9/1998", not to mention "probably late in the reign of Nero").

The development of a data type system for XML will bring the problems of relational querying and XML querying closer together in the future.

2.6 Linking

In XML, linking is expected to be a very important way of relating data. Links are perhaps most important for relating parts of data that are not near each other in terms of prose flow. The Web's current requirement to break documents into small parts makes this even more important: Such parts are attached by links that are generally indistinguishable from other links; distinguishing link types would be somewhat easier if the rel and rev attributes were used consistently, but they are not. Work such as Gibson et al. (1998) shows how valuable such relationships can be for Web-scale search, even when they must be inferred through sophisticated (i.e. expensive) methods.

But in addition to this widespread need, it is worth noting that any structure that can be represented by a subelement or attribute can also be represented instead by a link. For example, one can indicate that some portion of data is a poem or price-tag, or that it is to be formatted a certain way, by creating the claim elsewhere and linking it in.[6] On the Web, this will become extremely important as people begin to attach their own annotations, elements, or attributes to other documents for their own use, without modifying those documents. (XLink provides this capability.)

In such a context, the links between data objects must participate fully in querying, and queries must be able to reference them with full generality, test, count, and otherwise combine them into larger queries. For example, in order to search a document for words (tagged <w>) which a parser (human or automatic) has interpreted as nouns (using the <interp> element with a wordclass attribute to specify the word class and an instances attribute to point at the nouns of the text), one might issue a command like find all W elements whose ID value occurs as an IDREF in the INSTANCES attribute of INTERP elements with attribute WORDCLASS="NN". This query must be just as possible, and ideally just as simple to express, as a query which assumes that the part of speech information has been inserted into the wordclass attribute on the <w> element itself: find all W elements with attribute WORDCLASS="NN". The query find all U elements whose WHO attribute is the same as the ID attribute on a SPEAKER element with attribute SEX="M" (which finds the utterances -- <u> elements -- spoken by males) should not be harder to express than the query find all U elements with attribute SPEAKER-SEX="M".

In particular, indirect or typed linkages must be supported without resort to explosively long or convoluted query syntax. The examples above, for example, should not be required to specify whether the link occurs through an ID/IDREF link or through an extended link, possibly a chain of Xpointers; a preferable formalism would allow the expression of queries like find all W elements pointed to by INTERP elements with attribute WORDCLASS="NN" and find all U elements whose WHO attribute points to a SPEAKER element with attribute SEX="M".

2.7 Transformation

In addition to basic structural differences, typical query languages for relational data (such as SQL) devote considerable attention to data transformation: generating new tables or views, reports, and the like. This is also true of XSL (though XSL requires many quite different operations because it must produce ordered hierarchies, not tables), but it is quite different from the requirements on XPointer or generalized document retrieval queries.

2.8 Summary

None of these differences should be taken as a criticism of either RDBs or documents: Each model is highly valuable for the kind of data it was designed for. The point is simply that forcing either kind of data into the other model cannot work well. That's why XML models for tables are a traditional sore spot, just as supporting XML on top of RDBs is (other than as blobs). Of course in a Turing-complete system anything that can be done in one model can in principle be done in the other: at some level, it's all just bits. But the most natural operations and assumptions of the models are in stark contrast. Simulating either with the other is expensive to develop, hard to maintain, and slow.

Failure to address these fundamental differences has, in our view, seriously weakened many query languages proposed for documents. Some proposals are only weakly suitable for documents: they support only queries that would also make sense in tables. Using such query languages for XML documents is like editing XML with WordPad or vi: it can be done, but it is probably not the best way to do the job. Table-oriented query languages can easily be made to look good by careful selection of mostly tabular XML schemas and documents; evaluation of query languages for general use needs to be performed on a more realistic mix of document structures.

Typical approaches of the table-oriented sort add various operators such as CONTAINS, but have no way of managing text that is divided among several hierarchically-organized containers. Or, they may be able to retrieve elements that contain both an <abstract> and a <summary>, but not test what order those sub-elements occur in, or whether they are adjacent. No proposal of which we are aware does much at all with hyperlinked data (though see Abiteboul (1998) on the theory of incorporating links into querying). This makes perfect sense because notions such as before, after, and adjacent simply do not exist in relational databases, and links are present only in the form of foreign keys, without link-type or role information. But these notions are very commonly important in queries on document bases, and need to be design centers, not marginal or expensive extensions on the periphery; otherwise documents will be second-class citizens in the retrieval language.

Finally, it is worth noting that RDBs and documents are by no means the only data models or topologies around. Hierarchical file systems, for example, differ from both: directories are deeply and importantly hierarchical, but the members (files) of a directory are unordered. Also, where XML provides IDs that are unique across an entire document and across all element types, directories provide only locally unique filenames, and RDBs provide keys that are unique per key field per table. These represent quite different scopes, and therefore require different semantics for any operation dealing with unique identifiers. This places a roadblock in the path of those who adopt the otherwise appealing approach of extending directory-access languages to handle XML querying, or vice-versa.

3 Query requirements of XSL, XPointer, and general user querying

XSL, XPointer, and general user querying all share some characteristics. They are, after all, operating on the same ordered hierarchical document structures. This has led to repeated proposals to design a single query language that could be used for interactive general user querying, for XSL patterns governing style-rule selection, and for addressing in XPointer. Such a unified language is a worthwhile attempt, but is not necessarily the right approach just because there is some overlap. In particular note that a query takes rather different forms in different contexts:

Formally, this means that the functions actually needed may take any of the following forms:
Predicate X set of Doc -> Doc
Predicate X set of Doc -> set of Doc
Predicate X set of Doc -> list of Doc
Predicate X set of Doc -> list of Loc
Predicate X set of Doc -> set of Docfragment
Loc -> Predicate
Predicate -> set of Loc
Predicate -> < Doc, Handle >
Predicate X Loc -> Boolean

This section more specifically compares the general query-like facilities needed in these different systems. Note that the requirements for XSL and XPointer are reasonably well understood and documented; requirements for general querying and schema checking have not been worked out in similar detail, and in those columns we are relying on our own judgement about plausible requirements.

3.1 Predicate (test) types

The kinds of data characteristics that can be tested via a query are largely the same in all applications. This is because the XML information structure only contains certain information (the final definition is being worked out in the DOM and XML Information Set WGs), and that is just the information that must be accessible. Thus the first set of requirements is quite similar:

Test element types (GIs)YYYY
Test attribute valuesYYYY
Test ancestors’ GIs and attributesYYYproposed
Test order relative to siblingsYYYY
Test for place in a pattern of siblingsYYY?
Test descendants’ GIs and attributes?YYproposed

3.2 What can be found (to return)

The different applications have similar but not identical needs for the types of data portions that can be returned. For example, all applications should be able to return any node of an XML tree, but there is seldom need for general querying to find a single character or a point selection (while XPointer has this as a very important requirement). XSL needs to address non-elements for various purposes, such as adjusting a single character, formatting drop-initial capitals, and so on. If schema validators are to test the lexical structure of attribute values and the content of elements, they will also require addressing to the character level. Intermediate results may take forms which cannot be returned as the final result of the query.

Addressing charactersYYNY
Addressing strings not broken by markupYYY?
Addressing elementsYYYY
Addressing attributesNY?Y
Addressing processing instructions?YN?
Addressing selections (anything a user can drag)NYNN

3.3 Result-set size

Only in XPointer and some situations in schema checking is it an error to find no result. User querying often returns no hits (it is, after all, an exploratory activity), while an XSL stylesheet may well define styles for elements that do not occur in some of the documents the stylesheet is used on (such as the many optional elements in most DTDs). On the other hand, XPointer may have little need for pointing to sets of locations, while user querying surely must. Thus, in this area we get into substantive differences in requirements.

Finding no match, is an errorNoYesNoSometimes
Single returned location expectedN/AUsuallyNoSometimes
Support for multiple returned locationsNoNo(?)YesYes
Results sortedNoNo(?)OptionNo

3.4 General issues

These issues involve the way an entire application `thinks' of querying. The biggest difference here is that for XPointer and general user querying, a query expression is given, and data is searched for the match(es). But in XSL just the opposite happens: a piece of data is given, and the goal is to find a query that matches it. In particular, XSL must find the best matching query. If (as is typical) there is one style that applies for <p> in general and another for <p> within <footnote>, then the latter must be selected. In a schema language with CHECK statements analogous to SQL CHECK constraints, the CHECK statements might resemble pointers in some respects, and general user queries in others.

Query arbitration
(find best matching query given location)
Cross-document matchingNNYY
Target authenticationNYYY
Transform data before returnYN??

The feature set in any XML searching application should include ordinary operators such as Boolean operators, constants, variables, parentheses, and comparisons of the supported types. While there are many such operations in common, there are clearly significant differences as well. If a single language is to be devised that can handle all of these requirements, it must be done with great care to avoid compromising the specific requirements of each application.

As an example, if a language is designed with operations that can naturally return many hits, it may be awkward or confusing to subset it for handling only single hits: in such a language, whether multiple hits are returned depends not on the operators used, but on the data. Defining full double sets of singleton and multi-return operators would solve this question, but would not have a good effect on the clarity or unity of the language.

4 Functional requirements for a generalized query language for hierarchical, linked documents

This analysis is excerpted from a feature set the first author developed starting in April of 1998 based on a number of standard and proprietary languages for document and database querying. The feature set here tries to capture all the basic operations needed to find things in the structure, content, and linkage of hierarchical, linked document data. Many of these features can be built from others if the right core set is chosen and adequate ways to combine operations are also provided. We make no claim that this is the right way to package these capabilities -- it is not intended to be that at all. But a language that claims adequacy should be able to accomplish all these functions one way or another, and should not require ridiculously long workarounds to get them.

4.1 Datatypes used

Data objects occur in three places: in the data itself, in the expression of a query, and in the results generated by a query (intermediate and final query results). A query language must include atomic datatypes such as numbers, dates, truth values, and strings; operations for testing and combining data; as well as document parts such as (in XML) elements, attributes, PIs, and so on. Results are ordered (or potentially unordered) sets of locations in the data: handles to the particular data portions that match the query.

For example, find all chapters with attribute type='secret' shows most of these types in use:

The types needed include at least:

It is very important to specify clearly the relationship between the datatypes returned from queries and the datatypes in the documents being searched. It is a priori desirable to unify these two sets, and thus to achieve closure. However, in practice closure has proven frustratingly difficult to achieve in XML query languages. Some requirements may necessitate allowing more types in results than are strictly necessary in the data. For example, XPointer has the very basic requirement to model any user selection. This, after all, reflects the simplest, most basic interface for creating XPointers (select and create link), as well as many other user interactions. Yet only a small percentage of possible selections constitute single nodes in an XML document.

A language must also be very clear about what datatypes can be explicitly or implicitly converted to other types. Explicit conversions (or casts) should be available wherever they make sense for the datatypes involved. Implicit casts can, if carefully designed, make a language far easier to use, though they risk encouraging imprecision and uncertainty about the precise meaning of expressions.

For example, many queries can be vastly simplified if there is a rule that a query can be embedded in a larger query and treated implicitly as a Boolean value: if the sub-query returns anything, it is treated as true, otherwise as false. As one simple case, one might want to find all block quotes (<bq>) directly containing more than one paragraph (<p>). This can be done by looking for the second <p> child and seeing if it exists.

Find elements e where type is BQ and exists(e.child(2,P))
If we implicitly treat embedded queries as Boolean tests, we can simplify the query:
Find elements e where type is BQ and e.child(2,P)

The syntax here is for illustration. The is indicates a reserved relationship, namely that of an element to its element type name; while the dot indicates a relationship that must exist in the document structure relative to e. Such relationships are already well-handled by XPointer, and so we use XPointer syntax for such relationship predicates in most of the examples to follow.

We have not listed type conversions in the catalog of capabilities below, but they must be provided (whether explicitly or implicitly) in any actual language.

The means by which the language determines that a given portion of XML content or attribute value amounts to a date, a number (expressed in some numeric notation such as decimal, hex, or scientific notation), etc., must be solved as well. This is a slightly different problem than in RDBs, because all XML data starts out expressed as strings in the XML source, and because desired types for processing (such as integers) may be legitimately represented in multiple ways in the XML. Type declarations in XML schemas will solve the problem for some documents, but not for all.

4.2 Specific operators or predicates

We have tried to make the set of capabilities below symmetrical, since lack of symmetry reduces language functionality (often in surprising ways), and increases complexity. Asymmetry can result from designing a syntax too early, on the basis of only a few simple cases: the result is that the cases foreseen by the designers are simple, but unforeseen cases become impractically convoluted. Considering in advance the full range of requirements makes it easier to design a language where simple things are simple to express, moderate things are merely moderate, and only truly hard things are hard to express.

We have not separately addressed making the query language namespace-aware. In principle, any reference to an element type should be replaceable by a reference to a namespace-qualified name. This requires a uniform way to qualify any element type referenced in a query. Operators are also needed to query aspects of namespace use itself: find all documents using namespace N, that intermix elements from namespaces N and M, and so on.

Many of the functions listed can be useful either as a request ("get nth child of X") or as a predicate ("is this thing I've got, the nth child of X?"). We have not listed both variants unless their formulation seems non-obvious. A simple rule is used in many languages to save on such definitions: any request can be treated as a predicate: if the request in fact finds anything, it comes out true; otherwise, false. For, example, a request for the third child of type <P> can be treated as a test for whether there are at least three <P> elements in a given context. Testing for the total number of nodes fitting a given condition requires an explicit operator that returns that number (commonly called the cardinality operator).

Many of the operations listed can have mirrored operations: the operation X precedes Y has a mirror pair in Y follows X. We have not listed the extra forms, though in some cases it is convenient or necessary to provide them (such as means for getting to preceding siblings as well as following).

The operations are organized under these categories:

Structural information
Providing access to the information structure in terms of children, ancestors, siblings, links, attributes, etc. This is where most of the operations appear that are not common to query languages for non-nested, non-ordered data.
Content information
Access to the atomic datatypes expressed in XML by characters in content and attribute values. These operations are more familiar from traditional database and IR applications, since they can just as easily apply to flatter data objects.
General query functionality
Operations and predicates that combine or compare information about structure and content, allow users to express queries more clearly and briefly, constrain the location or quantity of search results, and so on.
Aggregate operations
Operations that deal with query results, organizing them into sets, ordered sets, or other types.

4.3 Structural information

(see also the notes following the table)

4.3.1 Structural predicates

Equivalence predicates
Location X is the very same place as Y
X has exactly the same content/structure as Y
Sibling (horizontal) predicates
X is a sibling of Y
X is the adjacent sibling of Y
X is preceding sibling of Y
X is immediately preceding sibling of Y
X fits location L in node pattern P:
Find \1 in "TI, \([P Q]+ \), UL, [^UL OL]* $"
(finds the sequence of <P> and <Q> elements after the <TI> element and before the <UL>)
Precedence predicates (global)
X precedes Y (start tag is earlier)
X directly-precedes Y
X directly-precedes Y up to/at n nodes
Global patterns (compare above under sibling predicates)
Containment (vertical) predicates
X is a child of Y
X is a descendant of Y
X is a descendant of Y up to/at n levels
Vertical (ancestry) patterns (compare above under sibling predicates)
Link predicates
X directly links to Y
X indirectly links to Y
X indirectly links to Y at/up to n levels
X and Y are directly linked (either direction)
X and Y are indirectly linked (either direction)
X and Y are indirectly linked at/up to n levels
Link overlap predicates
Part of X directly links Y
Part of X indirectly links Y
Part of X in directly links Y at/up to n levels
X directly links part of Y
X indirectly links part of Y
X indirectly links part of Y at/up to n levels
X and Y are partially directly linked
X and Y are partially indirectly linked
X and Y are partially indirectly linked at/up to n levels

4.3.2 Structural operations

Child operations
Get first child
Get last child
Get nth child (signed)
Path expressions (node.node.node to traverse tree, with choice predicate at each step -- see OQL, XPointer, etc.)
Descendant operations
Get first descendant
Get last descendant
Get nth child (signed)
Get number of levels of X below Y (or root)
Ancestor operations
Get first ancestor (=parent)
Get last ancestor (=root)
Get nth ancestor (signed)
Get-ancestor-or-self such-that
Sibling operations
Get first-sibling
Get last-sibling
Get nth-sibling (signed)
Get first-sibling-or-self
Get last-sibling-or-self
Get nth-sibling-or-self (signed)
Precedence operations (global)
Get first-node
Get last-node
Get nth-node (signed)
Link operations
Get node linked from X via link L as linkend E
(L and E may be predicates on links)

4.3.3 Notes

Many relationships have variant forms where the relationship is immediate, indirect of arbitrary distance, indirect but of limited distance, or indirect but of precise distance. There seems no principled reason to exclude these variants for some relations while excluding them for others; however, some are likely to be less used than others. In this list we show first, last, and n-th operations because they are so commonly needed.

Of course one can simulate first via a get place number operator and filtering out all returns for which it returns other than 1. However, doing so has problems: (a) It obscures the user's actual intent; (b) it is harder for users to figure out how to do a simple operation; and (c) it makes it harder to implement query optimizers, which have more opportunities with high-level specifications than with low-level ones.

A language must allow operations and predicates to be combined, so that users can, for example, ask for the first, last, or nth element such that it fulfills a given predicate (however simple or complex the predicate may be). In a reasonable syntax design this should not require extra operations, just the combination of more basic ones shown here.

XPointer has long included a simple way to combine these operations: An instance-number argument that takes signed values (or an ALL keyword): id(foo).child(1,P). Negative numbers count backwards from the end of the range involved, so -1 provides last functionality.

There have been proposals to extend this parameter to allow multiple comma-separated values, thus covering the range functionality. This is very simple to specify syntactically, and can be quite useful, especially when a range is needed. There is also a case for making the range function independently (say, as a location term or suffix instead of a parameter); this gains some additional power because it can be applied either before or after another selection operation. This is similar to the HyTime listloc approach.

Link predicates need to be able to be conditioned on link type: it is necessary to be able to say X is linked to Y but it is also necessary to be able to provide more information: X is linked to by a link of type "critical" as the linkend with role "reason". The more commonly hypertext links are used to enrich documents with more information, the greater will be the necessity of distinguishing among different hyperlinks by type, and among their link-ends by role.

The sibling-or-self and ancestor-or-self operations may seem surprising at first glance. They are included because experience with structural retrieval and characterization languages has shown they are highly useful. For example, a hyperlink may want to specify a particular sub-resource target, that might be a selection, paragraph, or other small data portion. But for display, application or even legal conventions may require that an entire <CHAP> be accessible as context (in concrete terms, there may be a requirement that the entire <CHAP> be accessible by scrollbar). The obvious way to do this is to specify the display scope as (in XPointer terms) ancestor(1, CHAP). But if the link happens to be to the entire <CHAP> to start with, that specification will fail (or, if the <CHAP> is within another <CHAP>, it will leap to the larger one). These operations permit the starting node to satisfy the predicate. In mathematical terms, they support closed intervals, not just open intervals. Like many of the other operations, they can be simulated.

4.4 Content information

Property predicates
X has attribute named N
X is of same namespace throughout descendants
Property operations
Get node class of X (element, pi, comment, etc)
(could also have is-a predicate for each)
Get originating entity of X
Get element type of X
Get namespace of X
Get all namespaces used within X
Get attribute named N of X
Get 1st/last/nth attribute of X (??)
Content operations
Get text content
Get chars n-m of text content
Get 1st/last/nth match of regex R
Natural-language operations
Match regardless of case distinctions
Match regardless of accents
Match ignoring punctuation distinctions (", " vs " ")
Match on word roots (stemming)
Hyphenated words
Find (string or regex) match with proximity in characters
(all proximity operators may be ordered or unordered)
Find match with Proximity in words
Find match with Proximity in sentences
Find match with Proximity in elements

4.5 Comparison operations

Comparison is among the most basic operations required of any query language. Equality comparison can be applied to any datatype (including aggregates), but natural language strings pose some complications, such as the need for comparison while ignoring case, accent, punctuation, transliteration, digraph, or other distinctions, and restricting a string to being a whole token. Although the full complexity of these problems may not be soluble in the short term, some of these capabilities are so basic as to be requirements. A query language incapable of searching for the and The simultaneously would be unacceptable to much of the world, as would a language that permitted it only by lengthy workarounds such as [tT][hH][eE].

Inequality comparisons are more complex when the values are not numeric; strings in particular have sort ordering rules that are enormously complex, and vary from one language, region, and discipline to another.

String-contains (case sensitivity?)
Contains-at start/end (=regex ^…$)
Contains-as-token (=regex \<…\>)

4.6 Boolean operations

Without these operations, of course the range of conditions that could be tested would be severely limited.
Xor (of marginal use)

4.7 Arithmetic operations

Arithmetic operations are useful since they enable a range of predicates that cannot otherwise be simulated. For example, they allow testing that two parts of data are consistent, such as for checking online orders:

   find elements e where e is ORDERED-ITEM
      and e.attr(price) * e.attr(quantity) <> e.attr(total)

These operations could be limited to integer and real datatypes; in some cases it may make sense to also support date, time, string, or other types for some operations (such as subtracting dates and times to find elapsed time).
Absolute value

4.8 Miscellaneous features

String, numeric, date, time, and Boolean constants
Macros or "let" binding operator

4.9 Aggregate operations

Aggregates comes in many types. The most important ones within the query language (not including the various containers found in the data itself) are:

Members are not ordered, and contains no duplicates
Ordered set
Ordered, and may not contain duplicates

A query language must support the intrinsic operations relevant to these, since any query returns such objects: sets or ordered sets of locations.

Aggregates permitting duplicates may also be needed; this seems unclear. If they are needed, then the consequent conversion operations between aggregate types are as well.

It is probably acceptable to avoid the complexity of nested aggregates by asserting that result sets (ordered or unordered) cannot contain other result sets as members; of course they can still contain members which specify locations that are container elements - quite a different thing. Thus, any operation that generates lists of locations, even if nested, finally collapses them to a single, ordered location list without duplicates.

Since nodes in a tree do not exist entirely independently, but have relationships such as containment, operations that respond to partial or complete containment are also useful. For example, if a query returns all the descendants of a given chapter, the user might want to just get the chapter instead; operators to express this are invaluable as the amount of data being queried scales up.

A key problem is that some applications wish to apply the same queries at both the intra-document level, where things are ordered and there can be duplicate names (element types); and at the inter-document level, where things are unordered and there cannot be duplicate names (filenames). Integrating these two quite different structures is not simple.
Set/Ordered set predicates
X is the same set as Y
X is a set with all members structurally identical to those in Y
Set/Ordered set operations
Construct set from items
Extract members that satisfy predicate P (selection)
Map each member M to expr(M) (apply)
Ordered set operations
Get first member
Get last member
Get nth member (signed)
Ordered set of tree-nodes operations
Delete nodes with ancestors also in list
Delete nodes with children also in list
Delete nodes with descendants also in list
Delete nodes with all ancestors also in list
Delete nodes with all children also in list
Delete nodes with all descendants also in list
Replace sibling sequences by parents

4.9.1 Notes:

Many of these operations can be built up from more primitive ones. For example, Replace sibling sequences by parents is extremely useful in practice, but can be done via a more general aggregate-mapping operation, that maps all members to their parent. Either way, would presumably automatically eliminate duplicates parents unless aggregate types supporting duplicates prove necessary (in which case an explicit remove-duplicates operation would be needed).



[1] Note that the optimization of XML for hierarchical data makes it less adept at tabular data. That is why tables often seem awkward in markup systems: there is no principled reason to make rows or columns more important, but to represent a table as a tree one must do so. Most DTDs privilege rows; for example, that is why HTML has no direct representation of table columns, and cannot center-justify a whole column. CALS tables can justify columns, but only at the expense of a block of empty column-definition elements elsewhere.
[return to text]

[2] The example is due to Makoto Nagao of Kyoto University.
[return to text]

[3] For example, a representation in third normal form in which elements appear as entries in an elements table with columns for elementId, gi, parent, eldestChild, rightSibling, etc.
[return to text]

[4] The one-level query might be expressed thus:

SELECT c.elementId
  FROM elements c, elements p
 WHERE c.gi = 'FN'
   AND p.gi = 'CHAP'
   AND c.parent = p.elementId
or thus, using a nested SELECT:
SELECT elementId
  FROM elements
 WHERE gi='FN'
   AND parent IN
       (SELECT elementID
          FROM elements
         WHERE gi='CHAP')
A query that handles one or two intervening elements is somewhat longer:
SELECT elementId
  FROM elements this, elements mother, elements grandma, elements ggma
 WHERE this.gi = 'FN'
   AND ((this.parent = mother.elementId
   AND mother.gi = 'CHAP')
    OR (this.parent = mother.elementId
   AND mother.parent = grandma.elementId
   AND grandma.gi = 'CHAP')
    OR (this.parent = mother.elementId
   AND mother.parent = grandma.elementId
   AND grandma.parent = ggma.elementId
   AND ggma.gi = 'CHAP'))
Further discussion may be found in Steven J. DeRose and David G. Durand (1994), p. 187. The query is simpler in languages with a primitive notion of indirect containment: the HyQ query language defined by the HyTime standard can seek all sections containing a footnote thus:
<HyQ qdomain=mydoc>
   Select( DOMTREE And(
      Eq(Proploc(CAND GI) "SECTION")
      Select( DOMTREE Eq(Proploc(CAND GI) "FOOTNOTE") ) ) )

[return to text]

[5] The SQL query might now be expressed thus:

SELECT d.elementId
  FROM elements d, elements a, ancestors da
 WHERE d.gi = 'FN'
   AND a.gi = 'CHAP'
   AND da.descendant = d.elementId
   AND da.ancestor = a.elementId

[return to text]

[6] SGML veterans may be reminded, not accidentally, of the SGML CONREF feature, which is too weak and too inconsistently implemented to be seriously useful in this way, but which did provide a limited form of the functionality described here.
[return to text]

HTML generated 18 Nov 1998