Query Language Needs of Large, Heterogeneous Organizations

XML Query Language Requirements of Large, Heterogeneous Organizations Len Seligman, Arnon Rosenthal
{seligman, arnie}@mitre.org
The MITRE Corporation

The MITRE Corporation provides technical assistance, system engineering, and acquisition support to U.S. Government agencies and other large organizations. Our customers have high (not always realistic) hopes for XML and related standards. Candidate XML applications include:

A standard interchange format to support mediation among heterogeneous data sources
Different views of the same data
Semantic markup, to improve precision in information retrieval and to support automated processing by agents and other applications

To realize the potential, one needs a powerful query language for XML. This position paper discusses our customers’ environments and then considers the resulting query language requirements. We close with a discussion of issues that should be addressed by this workshop.

Significant characteristics of our customers’ environments include the following:

There will always be a need to interoperate with other organizations that use different vocabularies and different schemas. While many communities will develop agreements on DTDs and the semantics of their tagsets, applications can span several communities. For example, an Army medical application might use information conforming to Army logistics, military and civilian medical, and government finance DTDs, and the domains inevitably overlap. It is implausible that a global standard could be developed that suits all groups, plus other services (e.g., Navy), other domains (e.g., epidemiology), and all current and future coalition partners.
Security is essential on document collections, parts of collections, and on parts of individual documents. The granularity for security concerns may differ from those of the original documents. There is a need for easy-to-administer access controls with variable granularity, and by means of general rules that apply to thousands of documents.
Our customers have complex dissemination requirements, which span structured and unstructured data. Users need a rich language to express interest profiles. Also, users want to express profiles in their own terms, not those of the data sources.

Requirements

The characteristics just described lead to a number of query language requirements.

First, a powerful view mechanism is required to address structural and semantic heterogeneity, which are not addressed by XML. For example, one should be able to transform either representation of employee information in Figure 1 into the other. The transformation inverts the hierarchy, converts annual into weekly salary, and makes some element names (Worker and Drone) into ordinary values.

<Employees>
    <Worker>
        <Name>Dilbert</Name>
        <Salary>53000</Salary>
        <Department>Z11</Department>
    </Worker>
    <Drone>
        <Name>Wally</Name>
        <Salary>48000</Salary>
        <Department>Z11</Department>
    </Drone>
</Employees>

<Departments>
    <Dept Name="Z11">
        <Staff>
            <Person Name="Dilbert">
                <Weekly-Pay>1019.23</Weekly-Pay>
                <jobtitle>Worker</jobtitle>
            </Person>
            <Person Name="Wally">
                <Weekly-Pay>923.08</Weekly-Pay>
                <jobtitle>Drone</jobtitle>
            </Person>
        </Staff>
    </Dept>
</Departments>

Figure 1. Two XML Representations of Employee Information

One could perform such transformations in procedural code, but there are important advantages to declarative query languages, demonstrated by years of experience in heterogeneous database integration. Declarative languages permit automated tools to reason about transformations. This is essential for optimizing query processing in environments where the application programmer should be shielded from details of physical data organization and access strategy. These include end user access, distributed environments where information may be moved, and "multi-tier" environments where users interact with virtual collections because the native form was chosen by a different organization with other priorities. In addition, without a declarative representation, it is difficult to merge views, which is a requirement of dissemination-based systems, where communities of interest must be formed from multiple user interest profiles.

The following query language operations are needed:

Transformation requires arithmetic operators (e.g., for conversions like Celsius to Fahrenheit). Also helpful would be string manipulation operations (e.g., concat, substring). Finally, there should be a way to call arbitrary functions (e.g., dollars-to-yen), which in a web environment might be implemented as Java code.
Restructuring. Examples include hierarchy inversion and changing an element to an attribute. Significant restructuring can take place given the ability to nest queries and to "gather" their results—e.g., to create a new parent node with links to every member of a query’s result set. The XQL proposal of [Robie] does this gathering, but only at the root, where it wraps the result set with an "xql:result" tag.

The XML query language must define the structures to which security and dissemination specifications are attached. The actual mechanisms, such as access controls and event subscriptions, would be defined by other standards. However, the granules that one wants to release or disseminate may not be the stored ones. Views give a mechanism to specify these desired granules. Declarative views might also allow administrators familiar with one DTD to provide specifications that will be executed over documents conforming to another DTD.

To express interest profiles over diverse information sources, one needs common relational algebra operations (i.e., select, project, join), plus path expressions for graph traversal. The latter is supported by [Robie] and [Deutsch], while the former is omitted by [Robie].

Another requirement is that an XML query language allow users to control whether query results preserve the sequence of sibling nodes. Sequence is essential for document processing, but irrelevant in most structured data (e.g., in a relational table, the order of rows and columns is by definition not significant).

Issues

We close with a discussion of issues that should be considered by this workshop.

The W3C query language activity should specify a standard interface to the environment in which queries execute. A query doesn’t live in isolation; it needs to attach to an environment. Without the environment, "raw" query languages provide little interoperability (e.g., SQL without ODBC). ODBC and "Universal Data Access" (UDA) [Blakeley] are the reigning "standards" for such attachment. The query language activity should analyze them to understand the requirements better. Environment issues include:

What are the API requirements for manipulating query results. Do we need something like SQL cursors, which allow users and applications to navigate query results?
We should anticipate the need for programming language bindings. Do they raise any issues?
How are queries invoked? (e.g., as DCOM or CORBA requests?) This is probably a DOM issue.

XML query tools must interoperate with database management systems and pre-existing information retrieval tools. What are the requirements for wrapping tools so that they can be used by XML query tools and vice versa. In particular, an effort should be made to make the query language meld well with SQL.
How do we interoperate with systems that support a subset of the language’s features? UDA is especially strong here, supporting a simple abstraction—the row-set—that many sources can support, and then supporting additional features for more powerful sources.
We should avoid unnecessary duplication. Query features are now being discussed in several efforts (e.g., XML, XSL, and DOM). Before supporting overlapping functionality, we should convince ourselves that the requirements are different. In addition, while alternate syntaxes may be appropriate for different requirements, query processing semantics should be consistent across efforts (e.g., what is returned by a query, treatment of null values, ordering of results).

References

[Blakeley] Jose Blakeley, Michael Pizzo, Microsoft Universal Data Access Platform, Proceedings of ACM-SIGMOD International Conference on the Management of Data, 1998

[Deutsch] Deutsch, Fernandez, Florescu, Levy, and Suciu, XML-QL: A query language for XML

[Robie] Jonathan Robie, The Design of XQL