XML Query Language Requirements of Large, Heterogeneous Organizations
Len Seligman, Arnon Rosenthal
{seligman, arnie}@mitre.org
The MITRE Corporation

The MITRE Corporation provides technical assistance, system engineering, and acquisition support to U.S. Government agencies and other large organizations. Our customers have high (not always realistic) hopes for XML and related standards. Candidate XML applications include:

  1. A standard interchange format to support mediation among heterogeneous data sources
  2. Different views of the same data
  3. Semantic markup, to improve precision in information retrieval and to support automated processing by agents and other applications
To realize the potential, one needs a powerful query language for XML. This position paper discusses our customers’ environments and then considers the resulting query language requirements. We close with a discussion of issues that should be addressed by this workshop.

Significant characteristics of our customers’ environments include the following:

Requirements

The characteristics just described lead to a number of query language requirements.

First, a powerful view mechanism is required to address structural and semantic heterogeneity, which are not addressed by XML. For example, one should be able to transform either representation of employee information in Figure 1 into the other. The transformation inverts the hierarchy, converts annual into weekly salary, and makes some element names (Worker and Drone) into ordinary values.

<Employees>
    <Worker>
        <Name>Dilbert</Name>
        <Salary>53000</Salary>
        <Department>Z11</Department>
    </Worker>
    <Drone>
        <Name>Wally</Name>
        <Salary>48000</Salary>
        <Department>Z11</Department>
    </Drone>
</Employees>

<Departments>
    <Dept Name="Z11">
        <Staff>
            <Person Name="Dilbert">
                <Weekly-Pay>1019.23</Weekly-Pay>
                <jobtitle>Worker</jobtitle>
            </Person>
            <Person Name="Wally">
                <Weekly-Pay>923.08</Weekly-Pay>
                <jobtitle>Drone</jobtitle>
            </Person>
        </Staff>
    </Dept>
</Departments>
Figure 1. Two XML Representations of Employee Information

One could perform such transformations in procedural code, but there are important advantages to declarative query languages, demonstrated by years of experience in heterogeneous database integration. Declarative languages permit automated tools to reason about transformations. This is essential for optimizing query processing in environments where the application programmer should be shielded from details of physical data organization and access strategy. These include end user access, distributed environments where information may be moved, and "multi-tier" environments where users interact with virtual collections because the native form was chosen by a different organization with other priorities. In addition, without a declarative representation, it is difficult to merge views, which is a requirement of dissemination-based systems, where communities of interest must be formed from multiple user interest profiles.

The following query language operations are needed:

The XML query language must define the structures to which security and dissemination specifications are attached. The actual mechanisms, such as access controls and event subscriptions, would be defined by other standards. However, the granules that one wants to release or disseminate may not be the stored ones. Views give a mechanism to specify these desired granules. Declarative views might also allow administrators familiar with one DTD to provide specifications that will be executed over documents conforming to another DTD.

To express interest profiles over diverse information sources, one needs common relational algebra operations (i.e., select, project, join), plus path expressions for graph traversal. The latter is supported by [Robie] and [Deutsch], while the former is omitted by [Robie].

Another requirement is that an XML query language allow users to control whether query results preserve the sequence of sibling nodes. Sequence is essential for document processing, but irrelevant in most structured data (e.g., in a relational table, the order of rows and columns is by definition not significant).

Issues

We close with a discussion of issues that should be considered by this workshop.

  1. The W3C query language activity should specify a standard interface to the environment in which queries execute. A query doesn’t live in isolation; it needs to attach to an environment. Without the environment, "raw" query languages provide little interoperability (e.g., SQL without ODBC). ODBC and "Universal Data Access" (UDA) [Blakeley] are the reigning "standards" for such attachment. The query language activity should analyze them to understand the requirements better. Environment issues include:
    1. What are the API requirements for manipulating query results. Do we need something like SQL cursors, which allow users and applications to navigate query results?
    2. We should anticipate the need for programming language bindings. Do they raise any issues?
    3. How are queries invoked? (e.g., as DCOM or CORBA requests?) This is probably a DOM issue.
  2. XML query tools must interoperate with database management systems and pre-existing information retrieval tools. What are the requirements for wrapping tools so that they can be used by XML query tools and vice versa. In particular, an effort should be made to make the query language meld well with SQL.
  3. How do we interoperate with systems that support a subset of the language’s features? UDA is especially strong here, supporting a simple abstraction—the row-set—that many sources can support, and then supporting additional features for more powerful sources.
  4. We should avoid unnecessary duplication. Query features are now being discussed in several efforts (e.g., XML, XSL, and DOM). Before supporting overlapping functionality, we should convince ourselves that the requirements are different. In addition, while alternate syntaxes may be appropriate for different requirements, query processing semantics should be consistent across efforts (e.g., what is returned by a query, treatment of null values, ordering of results).


References

[Blakeley] Jose Blakeley, Michael Pizzo, Microsoft Universal Data Access Platform, Proceedings of ACM-SIGMOD International Conference on the Management of Data, 1998

[Deutsch] Deutsch, Fernandez, Florescu, Levy, and Suciu, XML-QL: A query language for XML

[Robie] Jonathan Robie, The Design of XQL