XML Query and transformation language

Authors:

Adam Bosworth
Andrew Layman
Adriana Ardeleanu
David Schach

Contributors:

Jennifer Widom (Stanford)
Alon Levy (University of Washington)

Vision.
We believe that it will be enormously useful to have a single language for moving any type of information around the web and have worked hard to enable XML to be this language. Similarly, we believe that it will be enormously useful to have a single language for querying XML. We further believe that in the web, it will not be practical for data providers to expose their underlying physical implementations of storage as SQL, for example, does today because:
a) Implementations will vary across both company and time and the consumers of this data in the web need a consistent invariant view. For example, one book vendor might provide books as a text file, another with a particular schema using Oracle, and yet another with a different model using ObjectStore. And the latter two might well change their schema and implementation as time passes.
b) The number of requestors of data can be huge. The costs of round-trips to the server are high and the server costs for serving huge numbers of customers simultaneously are still higher. Thus ideally, customers can ask, once for the data they need to do their job, and then go away leaving the servers free to handle other requests. This model requires that rich sub-graphs of data can easily be requested and materialized by value.
We believe that this XML Query and transformation language can and will be used to ask both for the rich sub-graphs of data and the explicit serialization of these graphs. The serialization will be important (e.g. the XML grammar) because it allows the consumers to consume a consistent shape even as the implementations on the servers evolve across time and across servers. Thus, we hope this workshop will emerge with a working group that can agree to work on a query and transformation language that is:

Expressive enough to be used for a rich set of graph to graph transformation,
Rich enough to describe the desired serialization, and
Optimizable.

Abstract:

We believe that XML can and will be used for two key purposes. It will be used as a uniform mechanism (really a legal fiction) for describing data whose actual storage model is some active store such as a relational database or an application where the provider wants to support logical views on this data without making any physical implementation commitments. It will also be used as a serialized data transport of all sorts of information varying from the serialized set of information that you want from an active provider such as a database to documents to private encoding of arbitrary graphs rendered in PERL. We believe that ideally one query and transformation language would be used for both purposes where it is the job of the query and transformation language to:
1) Take the complex potentially order dependent input graphs and emit new graphs that restrict and reshape as appropriate and
2) Describe the serialization of these new graphs such that the language can be explicit about what is serialized and what what is not and how it is serialized.

We do assume that all XML can be modeled as a graph albeit with order dependent edges and with edges that reflect containment, e.g. a physical sub-element within an element (see data model below).

We agree with some of the other papers that a query and transformation language should not be encumbered with concepts strictly associated with a style-sheet language. However, it is worth noting that in our view, the output languages typically will also be graphs with some serialization and, as such, should fall out of any query and transformation language that transforms graphs. For example, HTML and Adobe's PGML can both be thought of as graphs serialized into XML although the de-facto standard in HTML today violates this in some ways.

However, we do believe that it is important for a query and transformation language to describe not only graph<->graph transforms, but also how the resulting graph would be serialized. Why? Well, first, data is transferred around on the web. This means that the language must be precise about what is serialized. It also should be precise about the serialization shape (e.g. the resulting XML grammar) because several of the consuming applications will expect specific grammars (such as the browser or many applications written in C++ or Java.

It is a goal that the query and transformation language be as close to the transformation part of XSL as possible.

It is a goal that the language be extensible. As examples:

Unions and intersections could be added,
Queries on text can easily be extended to ask for questions like find all sentences with objects after verbs where position of elements matters
Aggregates can be extended to include new types of aggregates such as Mode or Median

What Microsoft will be building.

Some may ask what Microsoft is doing about all this. It is a fair question. Today, we are building a component that will be shipped as a standard system component starting with IE 5.0. This component can be used for tokenizing XML. The same component can be used to fully parse the XML and build a tree/graph or simply to pass tokens on to another piece of code that builds its own data structures. The same component supports XSL patterns today for quickly and efficiently finding nodes or collections of nodes within the tree/graph. The same component supports full XSL transforms from the input tree/graph to an output tree/graph. This component is designed to run on the both server and client with high speed and with ship with IE 5.0 simply as a distribution mechanism. Any language ranging from Java to C++ to any scripting language can use this component. We are also working with partners on building a Java version of this component.

Over time we expect to greatly enhance the language used for tree/graph queries and transformations as discussed below. We will be bringing a proposed language (see Submission below) to this conference. We also expect to put support for this language directly into our own stores so those requests for complex graphs of information may be made directly against our own stores with high efficiency.

We expect to work with partners and standards bodies to put together a framework for discovery on the web. This framework should enable engines to search for providers of information and goods and services who support specific services, specific schema, and specific parameterized queries or simply the entire general query and transformation language.

What we're not proposing.

In this conference, however, we are not proposing anything to solve the general service discovery problem. Nor are we proposing anything that solves the general problem of Metadata, which we'll simplify into:
1) Common schemas that could be shared for discovering documents or data
2) Figuring out how to ask questions across information providers who do not share common schema

Submission:

We hope to bring to this conference:
1) a proposed canonical model for XML for describing graphs ,
2) a proposed language for querying and transforming XML general enough to handle joins, aggregation, parameterization, general searching within a graph, and general graph construction along with the specifics the describe how to serialize this graph into the desired XML grammar,
3) a description of the underlying data model that this query and transformation language assumes.