XML Query Language Requirements of Large, Heterogeneous
Organizations
Len Seligman, Arnon
Rosenthal
{seligman, arnie}@mitre.org
The MITRE Corporation
The MITRE Corporation provides technical assistance, system engineering,
and acquisition support to U.S. Government agencies and other large organizations.
Our customers have high (not always realistic) hopes for XML and related
standards. Candidate XML applications include:
-
A standard interchange format to support mediation among heterogeneous
data sources
-
Different views of the same data
-
Semantic markup, to improve precision in information retrieval and to support
automated processing by agents and other applications
To realize the potential, one needs a powerful query language for XML.
This position paper discusses our customers’ environments and then considers
the resulting query language requirements. We close with a discussion of
issues that should be addressed by this workshop.
Significant characteristics of our customers’ environments include the
following:
-
There will always be a need to interoperate with other organizations
that use different vocabularies and different schemas. While many communities
will develop agreements on DTDs and the semantics of their tagsets, applications
can span several communities. For example, an Army medical application
might use information conforming to Army logistics, military and civilian
medical, and government finance DTDs, and the domains inevitably overlap.
It is implausible that a global standard could be developed that suits
all groups, plus other services (e.g., Navy), other domains (e.g., epidemiology),
and all current and future coalition partners.
-
Security is essential on document collections, parts of collections,
and on parts of individual documents. The granularity for security concerns
may differ from those of the original documents. There is a need for easy-to-administer
access controls with variable granularity, and by means of general rules
that apply to thousands of documents.
-
Our customers have complex dissemination requirements, which span
structured and unstructured data. Users need a rich language to express
interest profiles. Also, users want to express profiles in their own terms,
not those of the data sources.
Requirements
The characteristics just described lead to a number of query language
requirements.
First, a powerful view mechanism is required to address structural and
semantic heterogeneity, which are not addressed by XML. For example, one
should be able to transform either representation of employee information
in Figure 1 into the other. The transformation inverts the hierarchy, converts
annual into weekly salary, and makes some element names (Worker and Drone)
into ordinary values.
<Employees>
<Worker>
<Name>Dilbert</Name>
<Salary>53000</Salary>
<Department>Z11</Department>
</Worker>
<Drone>
<Name>Wally</Name>
<Salary>48000</Salary>
<Department>Z11</Department>
</Drone>
</Employees>
<Departments>
<Dept Name="Z11">
<Staff>
<Person Name="Dilbert">
<Weekly-Pay>1019.23</Weekly-Pay>
<jobtitle>Worker</jobtitle>
</Person>
<Person Name="Wally">
<Weekly-Pay>923.08</Weekly-Pay>
<jobtitle>Drone</jobtitle>
</Person>
</Staff>
</Dept>
</Departments>
Figure 1. Two XML Representations of Employee Information
One could perform such transformations in procedural code, but there
are important advantages to declarative query languages, demonstrated by
years of experience in heterogeneous database integration. Declarative
languages permit automated tools to reason about transformations. This
is essential for optimizing query processing in environments where the
application programmer should be shielded from details of physical data
organization and access strategy. These include end user access, distributed
environments where information may be moved, and "multi-tier" environments
where users interact with virtual collections because the native form was
chosen by a different organization with other priorities. In addition,
without a declarative representation, it is difficult to merge views, which
is a requirement of dissemination-based systems, where communities of interest
must be formed from multiple user interest profiles.
The following query language operations are needed:
-
Transformation requires arithmetic operators (e.g., for conversions like
Celsius to Fahrenheit). Also helpful would be string manipulation operations
(e.g., concat, substring). Finally, there should be a way to call arbitrary
functions (e.g., dollars-to-yen), which in a web environment might be implemented
as Java code.
-
Restructuring. Examples include hierarchy inversion and changing an element
to an attribute. Significant restructuring can take place given the ability
to nest queries and to "gather" their results—e.g., to create a new parent
node with links to every member of a query’s result set. The XQL proposal
of [Robie] does this gathering, but only at the
root, where it wraps the result set with an "xql:result" tag.
The XML query language must define the structures to which security and
dissemination specifications are attached. The actual mechanisms, such
as access controls and event subscriptions, would be defined by other standards.
However, the granules that one wants to release or disseminate may not
be the stored ones. Views give a mechanism to specify these desired granules.
Declarative views might also allow administrators familiar with one DTD
to provide specifications that will be executed over documents conforming
to another DTD.
To express interest profiles over diverse information sources, one needs
common relational algebra operations (i.e., select, project, join), plus
path expressions for graph traversal. The latter is supported by [Robie]
and [Deutsch], while the former is omitted by
[Robie].
Another requirement is that an XML query language allow users to control
whether query results preserve the sequence of sibling nodes. Sequence
is essential for document processing, but irrelevant in most structured
data (e.g., in a relational table, the order of rows and columns is by
definition not significant).
Issues
We close with a discussion of issues that should be considered by this
workshop.
-
The W3C query language activity should specify a standard interface
to the environment in which queries execute. A query doesn’t live in
isolation; it needs to attach to an environment. Without the environment,
"raw" query languages provide little interoperability (e.g., SQL without
ODBC). ODBC and "Universal Data Access" (UDA) [Blakeley]
are the reigning "standards" for such attachment. The query language activity
should analyze them to understand the requirements better. Environment
issues include:
-
What are the API requirements for manipulating query results. Do we need
something like SQL cursors, which allow users and applications to navigate
query results?
-
We should anticipate the need for programming language bindings. Do they
raise any issues?
-
How are queries invoked? (e.g., as DCOM or CORBA requests?) This is probably
a DOM issue.
-
XML query tools must interoperate with database management systems and
pre-existing information retrieval tools. What are the requirements
for wrapping tools so that they can be used by XML query tools and vice
versa. In particular, an effort should be made to make the query language
meld well with SQL.
-
How do we interoperate with systems that support a subset of the language’s
features? UDA is especially strong here, supporting a simple abstraction—the
row-set—that many sources can support, and then supporting additional features
for more powerful sources.
-
We should avoid unnecessary duplication. Query features are now
being discussed in several efforts (e.g., XML, XSL, and DOM). Before supporting
overlapping functionality, we should convince ourselves that the requirements
are different. In addition, while alternate syntaxes may be appropriate
for different requirements, query processing semantics should be consistent
across efforts (e.g., what is returned by a query, treatment of null values,
ordering of results).
References
[Blakeley] Jose Blakeley, Michael Pizzo, Microsoft
Universal Data Access Platform, Proceedings of ACM-SIGMOD International
Conference on the Management of Data, 1998
[Deutsch] Deutsch, Fernandez, Florescu, Levy,
and Suciu, XML-QL: A query
language for XML
[Robie] Jonathan Robie, The
Design of XQL