Query Languages for XML Documents:
A QL '98 Position Paper

Michael Rys, Stanford University

This position paper will present several aspects that I consider important issues in the design of a query language for XML based on my experiences in database and information system research and building database prototype systems, as well as developer of information system based applications.

Some Terminology

XML document Any document marked up with well-formed XML
XML data Any XML document that does only contain semistructured data structured by means of XML attributes and elements and does not contain any untagged CDATA or HTML text.
XML text Any XML document that is not XML data.
Query language A formal language to describe the search for data in a data collection, its restructuring and transformation (query), as well as the changes to the original data (update).

Why is a Query Language important?

XML will be (and already is) used to encode, provide and transfer partially structured data between data providers and consumers. In order to facilitate data and information retrieval for the consumer, it is necessary to provide query abstractions that allow access to the data in a declarative way (what do I want?). Standard navigational "query interfaces" that only allow navigation along predefined relationships will not scale to large amount of data and are not well-suited for efficient information discovery.

Fundamental Issues

XML Applications

Based on my experience, I see three main application domains, where XML is and will be playing an important role as data representation for data management and interchange:
  1. Document management:
    Documents will be encoded as XML texts, where certain information about the structure and metainformation will be represented in XML structures but most of the text will not be XML-tagged.
  2. Transfer of data from single repositories:
    In this case, data will be encoded as XML documents (most likely as XML data). The data might be stored in a specific XML repository or in another database system (relational, object-oriented), but the clients only see XML.
  3. Information integration among multiple repositories:
    In this case, data from different sources needs to be transformed from their source representation into a common representation suitable for the integration process (performed for example by mediators). XML is well-suited as the lingua franca of the integration layer due to its flexibility and portability. Most likely, data from the different sources will be represented as XML data for the integration.
In all three scenarios, XML is used to represent the data. However, the operational requirements and the underlying data model in all three domains differ.

Domain specific data models

Domain specific operational requirements

Goal: Common Data Model and Common, Extensible Query Language

Based on the different domain requirements, it will be important to decide what the target application domain of the query language will be. I hope, that the communities can agree on a common data model which would allow us to define a query language which provides operations for all three domains in a simple and elegant (and consistent) way. It is clear that this means, that it needs to provide operations normally found in database query languages, information integration systems and document management systems. It is important, that the query language can easily be extended, for example to accommodate new domains and their requirements (geographical queries etc.) and to add new document management operations.

Meta data

In any of the three domains above, meta information plays an important role. While XML provides a way to define simple meta information about XML documents in form of a DTD, more complex meta information needs to be provided as well. For example, a DTD can express relationships among objects (XML elements) by means of referential attributes. However, there is no standard way to define integrity constraints or ontology information (besides the sub-element relationship).

It will be important to query such meta information as well. If it is represented in XML, the query language can be used for querying the meta data. If the meta data is represented in RDF, then a RDF-QL needs to be specified in addition. If RDF is represented in XML, the RDF-QL can be mapped to the XML query language.

A Data Model for XML

Graph Structure

An XML document itself can be viewed as a linearization of graph structured data where the order of the different tagged and untagged elements in general is important. Unfortunately, the XML element hierarchy can only express tree structured data, the graph structure needs to be expressed using element attributes. Since there are many ways to linearize a graph, XML alone is not well-suited as its own data model. Either, the data model needs to be a full graph-based model, or XML needs to have a canonical form for representing the graph.

The query language should be able to deal with graph structured data:

Other questions that need to be addressed are:

Extensional vs. Intensional Order

Oftentimes, especially in the context of documents, but also in data management context, the extensional order is important. Thus, the data model should be able to preserve the extensional order of the XML documents.

The query language should therefore not only be able to allow the user to specify intensional order (e.g., via an order by clause), but also the extensional order in the case of updates. It should be able to preserve the extensional order when querying, if required by the user on a query-by-query basis.

Physical Design: Structured vs. Semistructured vs. Unstructured Data

For some application and in order to exploit performance opportunity, the physical design of the data model should exploit as much structural information as possible:

Query Language Operations

I don't want to go into a detailed description of all the operations. Instead, from the database and information integration point of view, I would like to refer to the research in the area of semistructured information processing. Especially the XML-QL proposal, Stanford's Lore and TSIMMIS projects present in my opinion a very good starting point. For the area of information retrieval, Lore has presented some ideas with nearness- and similarity-based query operators, but there are certainly other contributions from the document management world.

Besides the already mentioned points, it is, in my opinion, important that

Some Database Issues

The following aspects should be possible with the chosen QL and data model: