Observations on Structured Document Query Languages

Makoto Murata, Fuji Xerox Information Systems (murata@fxix.fujixerox.co.jp)

Jonathan Robie, Texcel Research (jonathan@texcel.no)

This paper briefly explores four topics related to queries:

Integrating structured and full-text queries
Transformations
Optimization, query processing, and indexes
Theoretical foundations for order-sensitive queries

One of the most valuable parts of this paper is the bibliography, which includes some significant articles that have not attracted as much attention as we believe they deserve. The first and third sections of this paper are primarily short overviews of the work to which we refer. The section on transformations explores whether transformations need to be performed in the query language itself, as opposed to doing transformations in an external environment. The final section, on theoretical foundations for order-sensitive queries, shows some preliminary ideas on how various formalisms might be applied in order to explore expressive power, complexity, and other attributes of structured document queries.

Integrating structured and full-text queries

Several researchers have pointed out that structured search and full-text search can be integrated (see, e.g., [BYN:96]). Many languages, including PAT, integrate searches in this way. The examples we show in this section are based on XQL syntax, but are not part of the current XQL proposal - our intent is merely to show the utility of this approach, not to propose a particular syntax. (In fact, there are still issues that must be resolved before this syntax could be added to the current form of XQL, but in this section we will ignore these issues.)

The following query searches for <author> elements that contain the string "DeToqueville":

author["DeToqueville"]

The following query searches for <author> elements that contain the string "DeToqueville" anywhere in their tree of descendents:

author[.//"DeToqueville"]

The following query searches for <author> elements containing both the string "Alexander" and the string "DeToqueville":

author["Alexander" and "DeToqueville"]

The following query searches for <author> elements where the string "Alexander" occurs before "DeToqueville":

author["Alexander" before "DeToqueville"]

In an environment where the results of queries can be returned as positions or ranges within documents, it can be useful to return the text itself. The following query searches for "DeToqueville" wherever it occurs in the current search context:

"DeToqueville"

The following query searches for "DeToqueville" wherever it occurs in a document:

//"DeToqueville"

Wildcards should also be supported:

//"DeToq*"

In addition, functions for proximity searching might be useful. The following returns <LINE> elements in which "rose*" and "sweet*" occur within 10 words of each other:

LINE[near("rose*", "sweet", 10)]

This would match lines like these:

<LINE>A rose by any other name would smell as sweet.</LINE>
<LINE>Sweet roses grew along the south side of the fence.</LINE>
<LINE>She rose and smiled sweetly at the purple dwarf under the bucket.</LINE>
<LINE>Say, has anybody seen my Sweet Gypsy Rose?</LINE>

Proximity searching requires some way to indicate how close the strings must be in order to match. This causes a difficulty when choosing the units in which proximity is measured. In existing full-text systems, distance is frequently measured in terms of words, which raises a number of significant questions regarding internationalization, but is probably an intuitive way to measure distance for most users.

Transformations

Transformations allow data to be restructured, allowing arbitrary structures to be created from the data of existing documents. Many query environments combine transformations with queries to provide general-purpose data manipulation; for instance, SQL allows new tables to be created by joining existing tables and allows tables to be sorted, XML-QL allows data to be restructured using general-purpose transformations in the CONSTRUCT clause of a query, and XSL provides general-purpose tree-to-tree transformations using a query-like pattern language to determine which nodes should be transformed. In this section, we consider whether a query language like XQL should also provide general-purpose transformations.

XML-QL supports transformations as part of the query language. Transformations are done by assigning variables in the WHERE clause, and using these variables in the CONSTRUCT clause to create a new tree. Here is one example, taken from the XML-QL proposal:

WHERE <book>
          <publisher><name>Addison-Wesley</></>
          <title> $t</>
          <author> $a</>
        </> IN "www.a.b.c/bib.xml"

  CONSTRUCT <result>
               <author> $a</>
               <title> $t</>
            </>

A different approach to transformations is discussed in [Schach98]. This paper uses XQL within XSL, capitalizing on the powerful tree transformation capabilities of XSL. For some kinds of queries and in some environments, it may be preferable to perform an XQL query first, then feed the results of that query into XSL to perform transformations.

To some extent, all of these approaches separate information retrieval from transformation. Although the underlying model of XML-QL integrates information retrieval and transformation, the syntax separates the information retrieval conditions (in the WHERE clause) from transformation (in the CONSTRUCT clause). In the approach of [Schach98]or other approaches that use XSL, transformations are done using a completely different language from the query language. Once we realize that transformations can be separated from the query language, we see that there are many reasonable approaches to performing transformations, e.g:

XSL transformations (as in [Schach98])
Document Object Model programs
A declarative transformation language (as in the XML-QL proposal, or perhaps using a separate declarative transformation language that is not part of the query language)

Separating transformations from the query language itself simplifies the query language significantly. On the other hand, a query language that also includes transformations is more integrated, and in some cases it can be more efficient than a loosely coupled approach to transformations.

One pragmatic approach is to consider which transformations are particularly useful in the query language itself, as opposed to transformations that can be performed just as well in a separate transformation environment like XSL. Two transformations seem worth considering prima facie: return operators (discussed in http://www.texcel.no/whitepapers/xql-design.html and sorting.

Return operators have functionality quite similar to that of SELECT clauses in SQL (though the syntax integrates them with the conditions expressed in the equivalent of a WHERE clause). There are two reasons to consider having return operators in the query language itself:

Users frequently want to control how much of a document is returned from a query, and return operators are much less complex when integrated into XQL.
Returns can be implemented much more efficiently as part of the query processor, and the choice of information to be returned also has ramifications for query optimization. Having return operators in the query language means that the database system has the information it needs to perform better query optimization, queries do not need to load extraneous information from the database, and less information needs to be sent to the transformation environment.

Consider an environment in which queries are executed on the server side, but transformation is done on the client side. If a query were done that included information from the beginning and end of a document, a language that does not support return operators would force the entire document to be sent to the client for the transformation to be performed. With return operators, only the information that is actually supposed to be returned is sent to the client, and transformations can be done using the resources of the client, freeing up the server for other tasks. Even if transformations are done on the same machine (in different processes), an efficient index implementation often allows the query to be evaluated without accessing the data itself, but an implementation that does not support return operators may be forced to load this data so that it can be pruned out again by the transformation engine, forcing both the query and the transformation processes to manipulate potentially large amounts of data that need not be retrieved in the first place.

Sorting is another operation that is traditionally performed as part of a query language. However, when XML is being queried, we assume that the children of a document node are normally found in the order in which they are used. A repository may be able to present these children in a different order, but in that case a different logical document is being queried. When a sort order is changed, it is usually changed as part of a larger transformation of the document, and it is unlikely that the database system would have indexes that allow this transformation to be done in a particularly efficient way. If this is true, then there is little advantage to performing the sort in the query language rather than the transformation environment.

Optimization, query processing, and indexes

A query language that will be used for large-scale repositories must be amenable to query optimization and efficient query processing, and it must be possible to build appropriate indexes on the data to support such queries. Since some of the better articles on these topics are somewhat difficult to find via a typical literature search, we felt it would be helpful to introduce them here. More articles on the subject will be found in the bibliography.

In [Macleod1989], Macleod describes a query language with capabilities similar to XQL; he also describes an index structure to support it. [Dao1997] describes a different approach to indexing structured data, and provides more information on query processing and optimization.

Navarro and Baeza-Yates [NB1997] focus on cababilities needed for a query language rather than a query language itself, and they focus more on querying processing than on index files. Their literature review is also extremely helpful.

[Clarke1995] is a particularly helpful article that provides an algebra for structured text search, defines higher-level operators based on this algebra, and shows how a query optimizer can be designed based on the algebra. It builds on the approach of [Burkowski1992]. Another very useful paper that uses a similar approach is [JK].

Theoretical foundations for order-sensitive queries

One of the strengths of the XML-QL proposal is its basis in theory (see the bibliography of [XML-QL]). Unfortunately, the theoretical foundation of XML-QL does not extend to order among document structure or content, but several mathematical formalisms exist that encompass both hierarchical and sequential relationships.

In this section, we consider several formal frameworks for query languages, attempting to provide an intuitive understanding of the basic aspects of each framework and some of the ways it might be used to evaluate query languages. These formalisms are useful for a number of purposes; for instance, we will discuss formalisms that are useful for comparing the expressive power of query languages, determining computational complexity, and designing query optimizers.

In order to compare the expressive power of query languages, we need formal frameworks that can encompass the structure of all query languages being considered. At least four such frameworks exist that are powerful enough to describe the languages discussed in the previous paragraph, as well as the retrieval portion of XML-QL. [Emerson] provides an overview of four such frameworks: branching-time propositional temporal logic, tree automata, tree regular expressions, and second-order monadic theory (with many successors). Emerson shows these four frameworks to have equivalent power for strings; when he tries to apply the same approach to trees, he concludes that not much is currently known about the relative power of these frameworks. Since query languages for XML inherently involve trees, this means that much work is still to be done before we can adequately compare systems described in one such framework with systems described in another. However, each of these frameworks is promising as a means of comparing query languages. All of these frameworks have greater expressive power than path expressions; in particular, they can impose conditions on sibling nodes and subordinates of siblings.

1. Regular path expressions

Query languages studied in the semistructured database community often use path expressions, which are regular expressions defined for the nodes of a path. A node matches the path expression if the path of this node is contained by that set. For instance, the XQL-ML proposal allows regular path expressions like these:

part*	Any hierarchy of zero or more elements in which every element has the element name "part".
*.part	Any hierarchy of elements consisting of a possibly empty series of ancestors and a part element.
part+	A non-empty hierarchy of elements that each have the name "part".
part\|component	A path consisting of one element, which is either a "part" element or a "component" element.
part+.(subpart\|component.piece)	A hierarchy containing a non-empty hierarchy of parts, which contain either a subpart or a component, which contains a piece.

However, path expressions cannot capture conditions on sibling elements, sibling of ancestor nodes, decendants of sibling nodes, etc. For example, we cannot state "Find a paragraph A that is followed by a figure B and that A and B are descendants of the same section". In other words, path expressions cannot take advantage of the order or tree structure of XML documents.

There are many query languages that can take advantage of order and tree structure. XQL is one example of such a language; PAT, an innovative language which has been used in both commercial and academic settings, is probably the best known and most influential language in this category. Other languages in this class are described in [BYN1996], which is a survey paper. Two more recent languages, not described in [BYN1996], are described in [Dao1997] and [NB1997]. It is difficult to compare the expressive power of these languages precisely, since they are based on very different theoretical frameworks, or sometimes even developed in an ad-hoc manner with no clear theoretical framework.

2. Tree Regular Expressions

Tree Regular Expressions [Thatcher1967] can be considered an extension of Path Expressions to allow them to represent tree structures. Although tree regular expressions are better known, hedge regular expressions(also known as forest regular expressions) [PQ1968] describe ordered sequences of trees, which correspond more closely to the structure of an XML document; hence, they are more directly applicable to XML query languages.

In a regular expression like "ab*c", the symbols are individual characters. Like path expressions, the symbols of a hedge regular expression are individual nodes. Hedge Regular Expressions use angle brackets to represent the tree structure relating nodes. The children of a node are placed. in angle brackets; if a node has no children, it is shown with empty angle brackets. For instance, the pointed hedge expression "b<>" describes a hedge consisting of a "b" node with no children, and the expression "a< b<> c<> >" describes trees with a root node labelled "a", containing two children: a "b" child followed by a "c" child.

Hedge regular expressions may also contain variables, traditionally labelled with the variables x, y, and z; such variables are used to indicate the place in a hedge where an operation occurs. For instance, the expression "a<x,x>*x(a|b)" is equivalent to "a<(a|b),(a|b)>*" . This matches any sequence of trees, each having an "a" root element and two children, where each of the children is either an "a" or a "b". This is an example of arbitrary sequence; there are similar operators that specify arbitrary depth.

However, hedge regular expressions are not sufficiently powerful to capture conditions on non-subordinate nodes such as ancestors, siblings, or descendants of sibling nodes. For purely theoretical reasons, [Nivat and Podelski] introduced Pointed Tree representations to capture these kinds of conditions for binary trees; however, as [Murata1998] has pointed out, their work also extends to hedges in a manner that is quite practical for the issues we are considering in this paper.

[Murata1998] introduces a means to use hedge regular expressions within a path expression; we call an expression which does this a pointed hedge representation. The pointed hedge representation for a given node consists of a triplet, e.g. <r1, s, r2>, where r1 and r2 represent hedge regular expressions. The first expression, r1, captures conditions on elder siblings of s; the second, r2, captures conditions on younger siblings.

Like temporal logic expressions, hedge regular expressions and pointed hedge representations may be used to compare the expressive power of query languages. Although they are much harder to read than temporal logic expressions, pointed hedge expressions have one important advantage: every hedge regular expression is equivalent to a hedge automaton, which can be used to evaluate the complexity of query processing algorithms. Just as string regular expressions and finite state automata defined for strings are equally expressive, hedge regular expressions and hedge automata are equally expressive.

3. Tree Automata

Tree Automata [Thatcher] are similar to finite state automata, but operate on trees rather than on strings. Like a finite state automaton, a tree automaton consists a set of states, a transition function, and a set of final states. Instead of traversing a string, a tree automaton traverses a tree from leaf nodes to the root node. For each node, it computes a state by applying the transition function to the symbol of that node and the states assigned to its subordinate nodes. A tree automaton accepts a tree if the state attached to the root node is one of the final states; subtrees can be evaluated in the same way.

In most of the literature, tree automata are assumed to operate on trees where each node has a fixed number of children, i.e. binary trees or n-ary trees. However, in the context of documents, it is important to assume that each node may have a different number of children. As it happens, the mechanisms of tree automata do not need to be modified, since they already cover this case adequately; nevertheless, we will use the term hedge automata (or forest automata) to emphasize the fact that we are defining operations on hedges, not merely on trees of fixed arity.

In a query environment, the result of a query may be seen as the set of subtrees accepted by the automaton, where the automaton corresponds to a hedge regular expression that express the query. The complexity of a query that can be expressed by a hedge automaton is always linear with respect to the number of nodes in the tree (as a worst case), which means that any query that can be captured by a hedge regular expression is also no worse than linear with respect to the number of nodes in the tree - an important result for any query language implementation.

4. Temporal Logic

Temporal logic is generally used to describe propositions that are evaluated with respect to time. The most common form of temporal logic is linear temporal logic, in which each proposition is an axis which may have different truth values at different points on the axis. For instance, the proposition "it is raining" may be true at 4 p.m., but not at 5 p.m. Branching temporal logic is used to describe multiple possible futures; for instance, at 5 p.m., the proposition "I am in Boston" may have different truth values, depending on whether the flight is delayed.

[Scotts-Furuta-Ruiz] and [Beeri-Kornatzky] have pointed out that the formalisms of temporal logic may be applied to queries for hypertext, both to describe time-sequences of browsing events and to describe constraints on the tree structure of documents. In the latter case, time operators like "in the future" or "in the past" are equivalent to hierarchical relationships like "descendant" or "ancestor". For example, to express the constraint that a section contains a footnote, we can say is-a(section) & "in-the-future" is-a(footnote). Such a constraint may also be used as a query, where the result of the query is the set of elements for which this constraint is satisified.

Query languages are a formalism for expressing constraints on trees, but query languages are used by many people who are not mathematicians. One advantage of temporal logic is that although it is a formal system, the language used for propositions is quite friendly, making it easier to introduce formalisms to those not accustomed to the dense manner in which they are normally expressed. Two temporal logic expressions may be shown to be equivalent, and transformations can be defined to transform one temporal logic expression to another. This is enormously useful in designing query optimizers and query processors.

However, because branching-time temporal logic cannot distinguish younger siblings and elder siblings, it needs to be extended to be able to express the constructs of a query language like XQL. This can be done by introducing further modal operators such as "in one of the younger sibling point" and "in the nearest elder sibling point".

5. Monadic Theory

[Consens and Milo] used monadic theory with many successors (a first-order theory of binary trees) to explore the expressive power and complexity of a variation of region algebra, showing that the region algebra they studied was unable to capture properties related to the nesting of regions, and also unable to capture order-related properties. They also used this framework to examine the computational complexity of several query-related problems. This paper is a good example of the use of formal frameworks to establish the properties of a query language.

First-order monadic theory with one successor has the same expressive power as propositional linear temporal logic; second-order monadic theory with one successorhas the same expressive power as qualified propositional linear temporal logic. [Consens and Milo]

Bibliography

[BYN1996] R.Baeza-Yates, G.Navarro Integrating contents and structure in text retrieval. SIGMOD Record, 25(1):67--79, Mar. 1996.

[Bertino1988] E.Bertino, F.Rabitti, and S.Gibbs. Query processing in a multimedia document system. ACM TOIS, 6(1):1--41, 1988.

[Blake1992] G.E. Blake, T.Bray, and F.W. Tompa. Shortening the OED: Experience with a grammar-defined database. ACM TOIS, 10(3):213--232, 1992.

[BCKLST1994] G.E. Blake, M.P. Consens, P.Kilpelainen, P.-A. Larson, T.Snider, and F.W. Tompa. Text/relational database management systems: harmonizing SQL and SGML. In ADB-94, LNCS, pp. 267--280. Springer-Verlag, 1994.

[Burkowski1992]F.J. Burkowski. An algebra for hierarchically organized text-dominated databases. Information Processing \& Management, 28(3):333--348, 1992.

[Christophides94] V.Christophides, S.Abiteboul, S.Cluet, and M.Scholl. From structured documents to novel query facilities. SIGMOD Record., 23(2):313--324, 1994.

[Clarke1995] C.L.A. Clarke, G.V. Cormack, and F.J. Burkowski. An algebra for structured text search and a framework for its implementation. The Computer Journal, 38(1):43--56, 1995.

[Colby1994] L.S. Colby, D.VanGucht, and L.V. Saxton. Concepts for modeling and querying list-structed data. Information Processing & Management., 30(5):687--709, 1994.

[Consens1994] M.P. Consens and T.Milo. Optimizing queries on files. In SIGMOD 1994, pp. 301--312, 1994.

[Consens1995] M.P. Consens and T.Milo. Algebras for querying text regions. In PODS 1995, pp. 11--22, 1995.

[Dao1997] T.Dao, R.Sacks-Davis, and J.A. Thom. An indexing scheme for structured documents and its implementation. In DASFAA '97, 1997.

[Emerson] E.A. Emerson. Temporal and modal logic. In J.van Leeuwen ed., Handbook of Theoretical Computer Science. Elsevier Science Publishers, 1990.

[GS1984] F.Gecseg and M.Steinby. Tree Automata. Akad\'emiai Kiad\'o, Budapest, 1984.

[GBS1992] G.H. Gonnet, R.A. Baeza-Yates, and T.Snider. Lexicographical indices for text: Inverted files vs. PAT trees. In R.A. Baeza-Yates and W.Frakes eds., Information Retrieval: Data Structures and Algotithm, pp. 66--82. Prentice-Hall, 1992.

[GT87] G.H. Gonnet and F.W. Tompa. Mind your grammar: a new approach to modelling text. In VLDB 1987, pp. 339--346, 1987.

[GPG94] M.Gyssens, J.Paredaens, and D.VanGucht. A grammar-based approach towards unifying hierarchical data models. SIAM Journal on Computing, 23(6):1093--1137, Dec. 1994.

[SGML86] ISO. Standard Generalized Markup Language (SGML), 1986.

[HyTime92] ISO. Hypermedia/Time-based Structuring Language (HyTime), 1992.

[DSSSL96] ISO. Document Style Semantics and Specification Language (DSSSL), 1996.

[JK] Jani Jaakkola and Pekka Kilpeläinen Nested Text Region Algebra as yet unpublished, draft version June, 1998.

[KM1995] P.Kilpeläinen and H.Mannila. Ordered and unordered tree inclusion. SIAM Journal on Computing, 24(2):340--356, Apr. 1995.

[Macleod] I.A. Macleod. SGML Documents and Non-linear Text Retrieval.

[Macleod1990] I.A. Macleod. Storage and retrieval of structured documents. Information Processing and Management, 26(2):197--208, 1990.

[Macleod1991] I.A. Macleod. A query language for retrieving information from hierarchic text structures. The Computer Journal, 34(3):254--264, June 1991.

[Murata1997] M.Murata. A theoretical foundation of the DSSSL location model. Math. Comput. Modelling, 25(4):95--107, 1997.

[Murata1998] M. Murata. Data model for document transforma tion and assembly (extended abstract). In Principles of Digital Document Processing '98, Vol. 1481 of Lecture Notes in Computer Science. Springer-Verlag Inc., 1998.

[NB1997] G.Navarro and R.Baeza-Yates. Proximal nodes: a model to query document databases by contents and structure. ACM TOIS (to appear), 1997.

[NP1993] M.Nivat and A.Podelski. Another variation on the common subexpression problem. Discrete Mathematics, 114:379--401, 1993.

[PQ1968] C.Pair and A.Quere. Definition et etude des bilangages reguliers. Information and Control, 13(6):565--593, Dec. 1968.

[Podelski1992] A.Podelski. A monoid approach to tree automata. In M.Nivat and A.Podelski eds., Tree Automata and Languages, pp. 41--56. North-Holland, 1992.

[SDMZ1995] R.Sacks-Davis, T.Arnold-Moore, and J.Zobel. Database systems for structured documents. IEICE Transactions on Information and Systems, E78-D(11):1335--1342, Nov. 1995.

[ST1992] A.Salminen and F.W. Tompa. PAT expressions: an algebra for text search. Acta Linguistica Hungarica, 41(1--4):277--306, 1992--1993.

[Takahashi75] M.Takahashi. Generalizations of regular sets and their application to a study of context-free languages. Information and Control, 27:1--36, 1975.

[TEIP3] Guidelines for electronic Text Encoding and Interchange. ACH/ACL/ALLC, April 1994.

[Thatcher1967] J.W. Thatcher. Characterizing derivation trees of context-free grammars through a generalization of finite automata theory. Journal of Computer and System Sciences, 1:317--322, 1967.

[Thomas1990] W.Thomas Automata on infinite objects. In J.van Leeuwen ed., Handbook of Theoretical Computer Science. Elsevier Science Publishers, 1990.

[XML98] XML (eXtensible Markup Language) http://www.w3.org/TR/1998/REC-xml-19980210, 10 February, 1998.