Querying and Transforming XML

Authors:

David Schach, Microsoft Corporation
Joe Lapp, webMethods Inc.
Jonathan Robie, Texcel Inc.

Contributors:

Michael Hyman, Microsoft Corporation
Jonathan Marsh, Microsoft Corporation

Abstract:

Many applications need queries and transformations for information represented in XML and these two subjects are closely linked. XSL provides a powerful transformation language for XML documents. The XQL Proposal, a superset of the XSL pattern syntax, addresses the information retrieval aspects of queries. This paper describes the benefits of using the XSL transformation language together with the XQL Proposal, to provide an integrated environment for queries and transformations.


Contents:

1. Introduction
2. Overview of XSL
3. Additional Query Capabilities
4. Relationship to XML-QL
4.1. Variables and Joins
4.2. Object Identifiers
4.3. Integrating Data From Multiple XML Sources
4.4. Database Management
5. Conclusion

1. Introduction

The XML 1.0 Recommendation has enabled a standard way to interchange data on the web. With this new capability inevitably comes the need for mechanisms for querying XML, shaping extracted data including sorting and filtering, transformation from one XML grammar to another, and often the presentation of this data to an end user.

The similarities between the approach proposed in XML-QL: A Query Language for XML (http://www.w3.org/TR/NOTE-xml-ql/) and the transformation language being defined by the XSL Working Group are striking. We believe some convergence of these two approaches would result in a powerful and flexible query and transformation language for XML.

XSL (and the XSL pattern language) provide some unique advantages as a basis for development of a query and transformation language:

  1. already a W3C work in progress
  2. modularity of the pattern (query) syntax, the transformation capability, and the formatting grammar
  3. unified mechanism for shaping XML and for generating presentation
  4. accomodates both regular data and highly irregular or recursive data structures

2. Overview of XSL

XSL is comprised of three modular areas:

  1. XSL patterns - XSL defines a "pattern" syntax which identifies nodes within an XML document. This capability provides the analogue of a SQL WHERE clause.
  2. "xsl" namespace - The transformation portion of XSL is expressed as an XML namespace which controls the materialization of query results as an XML document.
  3. "fo" namespace - The formatting capabilities of XSL are expressed as an XML grammar.

The XSL Pattern syntax is simple yet provides powerful query capabilities. Its purpose is to identify a subset of an XML document based on ancestry chains, wildcards, and qualifiers such as attribute value tests. The syntax is concise and modeled after familiar methods for directory navigation. As a string-based syntax it can reside within attribute values, script languages, and URLs.

The "xsl" namespace defines a set of commands enabling the materialization of query results as XML - in the same data grammar, a different data grammar, or in a presentation grammar such as HTML.

An XSL document contains elements from the xsl namespace, which may:

The formatting capabilities of XSL are expressed as an XML grammar - the "fo" namespace - completely independent of the transformation process.

3. Additional Query Capabilities

The companion document XQL Proposal, written by Microsoft, Texcel and webMethods, might be seen as a superset of the XSL Pattern syntax that provides Boolean operations, richer comparison operations, position, ranges, and other features.

A language like the XSL Pattern syntax or XQL is useful in its own right, and can be used in many environments besides XSL to query individual documents or collections of documents, whether those documents are represented as text or as DOM nodes. Queries are easily embedded as strings in programming languages or in XML or HTML attributes. Furthermore, this approach do not force any particular format for realizing results; results might be returned as XML text, as pointers to nodes, as structures representing regions of the document, or as primitive structures such as strings, integers, booleans, and arrays. This flexibility allows these languages to be used in many environments in the W3C.

In September 1998, the XQL Proposal was provided to the XSL Working Group as input when considering extensions to the XSL pattern syntax (http://www.w3.org/Style/XSL/Group/1998/09/XQL-proposal.html).

4. Relationship to XML-QL

The XML-QL submission does an excellent job of framing the set of problems needing to be addressed by a general purpose query and transformation language for XML. The capabilities described for XML-QL are often similar to those provided by the XSL transformation and pattern languages.

Both approaches are block structured and template oriented. Both offer the ability to return trees or graphs, create new elements in the output, and query XML. The biggest differences are syntactic. XSL uses a URL-like syntax for specifying patterns and queries and the inherent XML document structure to delimit query blocks. XML-QL uses an XML-like query-by-example pattern to select data and explicit keywords to delimit blocks.

The basic structure of our approach is:

<xsl:for-each select = "XSL pattern">
  <!-- Construct result -->
</xsl:for-each>

The corresponding structure of XML-QL is:

WHERE XML-QL pattern
CONSTRUCT output

In comparing the two languages, we will include some examples found in the XML-QL W3C submission and show how some of these queries can be written using our approach. The examples in this document operate on the following sample XML, also taken from the XML-QL submission.

<bib>
  <book year="1995">
    <!-- A good introductory text -->
    <title> An Introduction to Database Systems </title>
    <author> <lastname> Date </lastname> </author>
    <publisher> <name> Addison-Wesley </name >  </publisher>
  </book>

  <book year="1998">
    <title> Foundation for Object/Relational Databases: The Third Manifesto </title>
    <author> <lastname> Date </lastname> </author>
    <author> <lastname> Darwen </lastname> </author>
    <publisher> <name> Addison-Wesley </name >  </publisher>
  </book>
</bib>

A simple example from the XML-QL submission to selects authors of books published by Addison-Wesley:

WHERE <book>
  <publisher><name>Addison-Wesley</name></publisher>
  <author>$a</author>
</book>
CONSTRUCT $a

The equivalent using our approach is:

<xsl:for-each select = "book[publisher/name = 'Addison-Wesley']/author">
  <xsl:value-of />
</xsl:for-each>

The original XSL Proposal (http://www.w3.org/TR/NOTE-XSL.html) submission of August 1997 proposed a "query by example" syntax similar to XML-QL but later faced some challenges associated with this approach. The current URL-like syntax works well for both simple and complicated constraints. For example, suppose we want authors of books from more than one publisher. This is easily written:

<xsl:for-each select = "book[publisher[name = 'Addison-Wesley' 
                                  $or$ name = 'Microsoft Press']]/author">
  <xsl:value-of />
</xsl:for-each>

There are a number of features in XML-QL that are not currently addressed by our approach. These include:

We believe that these are interesting features to consider in attempting convergence of the two approaches.

4.1. Variables and Joins

Our approach does not support variables - a pattern simply returns the set of nodes that satisfy the pattern's condition. XML-QL patterns allow variables and return a set of all possible bindings of variables and values that statisfy the condition. This functionality is needed to support joins and more complicated queries and transformations. However, many queries in XML-QL that require variables can be accomplished through other means in XSL.

For example, consider the XML-QL query which groups results by book title:

WHERE 
  <book>$p</> IN "www.a.b.c/bib.xml",
  <title>$t</>
  <publisher><name>Addison-Wesley</></> IN $p
CONSTRUCT 
  <result>
    <title>$t</>
    WHERE <author>$a</> IN $p
    CONSTRUCT <author>$a</>
  </>

This can be written using our approach without variables.

<xsl:for-each select = "book[publisher/name = "Addison-Wesley"]">
  <result>
    <title><xsl:value-of select="title" /></title>
    <xsl:for-each select = "author">
      <author><xsl:value-of /></author>
    </xsl:for-each>
  </result>
</xsl:for-each>

However, in order to do joins variables are helpful. There is equivalent in our approach to the following XML-QL query which selects articles for authors who have written books after 1995.

WHERE 
  <article
    <author>
      <firstname>$f</>
      <lastname>$l</>
    </>
  </> CONTENT_AS $a In "www.a.b.c./bib.xml"
            
  <book year = $y>
    <author>
      <firstname>$f</>
      <lastname>$l</>
    </>
  </> IN "www.a.b.c/bib.xml",
  $y > 1995

CONSTRUCT 
  <article>$a</>

Variables could be added to our approach to support joins.

4.2. Object Identifiers

Object identifiers and Skolem functions work together to ensure that only one instance of an element is created in the output. In addition, object identifiers are helpful when building a graph. However, there are problems with object identifiers when the query result is a stream because they allow XML-QL to produce an element anywhere in the graph at anytime. As a result the graph has to be buffered up before it can be streamed. This has serious implications when transforming data on the Internet.

4.3. Integrating Data From Multiple XML Sources

XML-QL allows each WHERE clause to query data from a different document. XSL currently queries only a single XML document. Querying from multiple documents could be done by adding a source attribute to <xsl:for-each> and <xsl:apply-templates>.

4.4. Database Management

An important area commonly associated with query languages is database management functions including operations such as create, delete, update, insert. This area is not currently addressed by either XML-QL or by XSL.

5. Conclusion

The problems involved in querying XML are closely linked to transformation or result construction capabilities. The similarities between XML-QL and XSL suggest that these two proposals should cross-fertilize. The results of a collaboration along these lines could result in a powerful general purpose query and transformation mechanism for XML.