Database Desiderata for an XML Query Language
Department of Computer Science and Engineering
Oregon Graduate Institute
XML threatens to expand beyond its document markup origins to become the basis for data interchange on the Internet. Once it becomes pervasive, its not hard to imagine that many information sources will structure their external view as a repository of XML data, no matter what their internal storage mechanisms. It is not a great leap from there to viewing the combined information on the Internet as one big distributed XML database.
What is the role of a query language in this world? One could see it as a local adjunct to a browsing capability, providing a more expressive "find" command over one or more retrieved documents. Or it might serve as a souped-up version of XPointer, allowing richer forms of logical reference to portions of documents. Neither of these modes of use is very "databasey". From the database viewpoint, the enticing role of an XML query languagelet us call it XQueryis as a tool for structural and content-based query over the world-wide XML database. With that use in mind, we set forth a list of desired characteristics of XQuery.
In this section I set forth characteristics for XQuery that derive from its anticipated use as a net query language, along with an explanation of the need for each.
An XQuery should yield XML output. Such a closure property has many benefits. Derived databases (views) can be defined via a single XQuery. Query composition and decomposition is aided. It is transparent to applications whether they are looking at base data or a query result.
XQuery should be suitable for server-side processing. Thus, an XQuery should be self-contained, and not dependent on resources in its creation context for evaluation. While an XQuery might incorporate variables from a local context, there should be a "bound" form of the XQuery that can be remotely executed without needing to communicate with its point of origin. An example of local content that should be "bound away" is the local alias for a namespace. (See Requirement 2.11.)
Selection, extraction, reduction, restructuring and combination should all be possible in a single XQuery. This requirement is a consequence of Requirement 2.2, really. It should not be necessary to resort to another language or multiple XQueries to perform these operations. One reason is that an XQuery server might not understand the other language, necessitating moving fragments of the desired result back to the sender for final processing. Some of these operations greatly reduce data volumes, so are highly desirable to perform on the server side to reduce network requirements. Further, efficient query optimization and evaluation depends on having as much data access and manipulation described in advance as possible, to plan the best data retrieval, movement and processing strategies.
What I mean my these different operations, briefly:
The Appendix gives an example of these different manipulations.
XQuery should be usable on XML data when there is no schema (DTD) known in advance. XML data is structurally self-describing, and it should be possible to an XQuery to rely on such "just-in-time" schema information in its evaluation. This capability means XQueries can be used against an XML source with limited knowledge of its documents precise structures.
Conversely, when DTDs are available for a data source, it should be possible to judge whether an XQuery is correctly formed relative to the DTDs, and to calculate a DTD for the output. This capability can detect errors at compile time rather than run time, and allows a simpler interface for applications to manipulate a query result.
XQueries should preserve order and association of elements in XML data. The order of elements in an XML document can contain important informationa query shouldnt lose that information. Similarly, the grouping of sub-elements within elements is usually significant. For example, if an XQuery extracts<title> and <author> sub-elements from <book> elements in a bibliographic data source, it should preserve the <title>-<author> associations.
XQueries should be amenable to creation and manipulation by programs. Most queries will not be written directly by users or programmers. Rather, they will be constructed through user-interfaces or tools in application development environments.
An XQuery should be representable in XML. While there may be more than one syntax for XQuery, one should be as XML data. (Note that XSL is written in XML.) This property means that there do not need to be special mechanisms to store and transport XQueries, beyond what is required for XML itself. It also helps satisfy Requirement 2.7.
XQueries should be mutually embedding with XML. That is, an XQuery should be able to contain arbitrary XML data, and an XML document should be able to hold arbitrary XQueries. The latter capability allows XML document to contain both stored and virtual data. The former capability allows an XQuery to hold arbitrary constants, and allows for partial evaluation of XQueries. Representation of arbitrary constants helps with Requirement 2.2. Partial evaluation is useful in a distributed environment where data selected at one source is sent to another source and combined with data there.
XQuery should provide for following XLinks and XPointers. One expects much XML will contain external and internal cross-references, which a query should be able to traverse.
An XQuery should not be dependent on namespace aliases local to an XML document. An XQuery of course needs to disambiguate elements names that occur in more than one DTD for a document. However, it is unreasonable to expect the query creator to know the internal aliases that a document uses for those DTDs. Also, if a query is issued over a group of documents, they may well use different aliases for the same DTD.
XQuery should have an extension mechanism for conditions and operations specific to a particular datatypes. I am thinking mainly of specialized operations for selecting different kinds of multimedia content.
XQuery should be useful as a part of metadata descriptions. For example, a metadata interchange format for data warehousing transformations or business rules might have components that are queries. It would good if XQuery could be used in such cases, rather than defining an additional query language. Another possible metadata use would be in conjunction with XMI for expressing data model constraints. An implication is that queries should be able to stand alone, and not have to be appended to a URL or URI.
The items in this section have more to do with how the semantics of XQuery is expressed than with the particular capabilities of the language. They are meant to ensure the language has a clear definition and is amenable to efficient processing.
XQuery should have some formalization of its semantics. The formalization needs to be sufficient to support reasoning about XQueries, such as determining result structure, equivalence and containment. Query equivalence is a prerequisite to query optimization, while query containment is useful for semantic caching. (That is, a query can be associated with documents or document fragments in a cache, to determine if they can be used to satisfy later queries.) Query containment is also useful for determining if a push stream of data can be used to answer a particular query.
Expressions in XQueries should have referential transparency. That is, the meaning of an expression should be the same wherever it appears. Furthermore, expressions with equal result types should be allowed to appear in the same contexts. In particular, wherever an XML term is expected, an expression returning an XML term should be allowed. This latter requirements, interestingly, isnt met by SQL, much to the detriment of those who must implement and use it.
The need for clear and clean semantics for XQuery is probably best met by first producing a model for XML data that is independent of syntax (such as in DOM), and defining a logic or calculus over this model to serve as the formal basis to XQuery. Plunging right into syntax without a careful consideration of the underlying semantics is likely to give a language definition with ambiguity, gaps and hard-to-understand features.
Another observation is that XQuery should be less than a computationally complete language. Query languages derive great benefit from not trying to be computationally complete languages, such as queries against finite datasets having finite answers, and certain properties of queries being decidable. It can be quite tempting to add clever little features into a language, but one should be careful to distinguish whether they are simply syntactic convenience or actually increase computational.
This section lists a few other topics that I think are important to the design of XQuery, but where I am not quite certain what the right requirement is.
Some proposals I have seen stress the importance of a query being able to return locations within an XML document. Consequently, query results arent XML data. I need a better understanding of this requirement, to know if some object-identity-like construct is needed, or whether the requirement might be met with locations defined logically, such as with an XPointer-style canonical reference.
Query processing derives much of its power by being collection oriented, rather than single-item oriented. Thus, it would make sense to require that XQuery be collection oriented. However, given that XML data naturally represents collections, such a requirement almost seems redundant.
One way to get scaling in distributed applications with many user requests is to set up push streams of data with commonly requested information. While it may be too restrictive to insist XQuery be a stream processing language, it might be good if a natural subset could serve that purpose. Stream processing would require that a predicate can be evaluated on an element locally, or at least with a limited amount of information on the preceding stream and bounded lookahead.
One way to give XQuery a precise semantics is to provide a mapping from its features to an existing well-defined programming language, such as ML. Such a translation would provide for quick initial implementation of the language, and would support a degree of "semantics debugging" before finalization.
Any standardization effort is at the core a political process. Technical considerations such as the ones above are only part of what drives the debate and negotiations, and it is hard for academics such as myself to grasp all the considerations that enter into the decision-making process, such as market position, future product plans and competing standards. Nevertheless, understanding the intended use of a language, and divining what that use requires of the language, can only improve a standards prospects for success.
Preparation of this position paper has been supported by NCR/Teradata through PSDS, Inc. I thank Bruce McLean for discussions and comments on the paper, and Jeff Naughton, whose example I filched.
Consider trying to find the best prices on cars that ranked in the top 10 of the NHSC crash-safety tests. To produce that information, we want to combine information from a rankings document from NHSC, with documents from different auto dealers and brokers listing their prices. Say that we are interested in finding out if there are models in the top 10 of the rankings for sale under $30,000.
The NHSC provides XML data of the form
<model> <mo-name>Sable LT</mo-name>
while the dealers and brokers publish information in the form
<option opt="lthr seats"/>
Selection and Extraction:For our query, we want to select and extract <manufacturer> elements from the NHSC data where some model has <rank> less or equal to 10. For the dealer information, we want to select and extract <vehicle> elements where price is less than 30,000.
Reduction: For the<manufacturer> elements, we want to drop <model> sub-elements where <rank> is greater than 10. We also want to elide the <front-rating> and <side-rating> elements from the remaining models. For the selected <vehicle> elements, we want to drop the <color> and all <option> elements. The result items look like
Combination:We want our query to connect <manufacturer> and <vehicle> elements where <mn-name> = <make>, <mo-name> = <model> and <year> = <year>.
Restructuring: There are many ways to structure the resulting information. One possibility is as a collection of<car> elements of the form