Position Paper for the W3C query language Workshop December 3, 1998

Editor:

Adam Bosworth, Microsoft

Contributors:

Alon Levy, University of Washington

Jennifer Widom, Stanford

Roy Goldman, Stanford

Jason McHugh, Stanford

Andrew Layman, Microsoft

Adriana Ardelwanu, Microsoft

David Schach, Microsoft

 

Purpose of this document

Discuss requirements for an XML query language and then propose a straw-man solution largely for the purpose of illustrating the requirements and at least one manner in which they can be met. The specific syntax below, therefore, shouldn’t be considered seriously. It is used only to illustrate the function points being made and to demonstrate that many of the requirements can be met. It is assumed that a working group within the W3C would actually work both on the underlying algebra of any query language and on the syntax.

Query language and Requirements

There are two excellent submissions to this conference, "database Desiderata for an XML Query Language" by David Maier and "Position Paper on XML Query" by Paul Cotton, David Fallside, and Ashok Malhotra from IBM that largely mirror my own views on requirements for any query language for XML. So, through judicious plagiarizing, I’ll summarize their points and agree with them.

  1. The language should have closure. It should take XML in and generate XML out. It should be a full graph to graph transform meaning that it can consume and generate XML-LINK’s and ID/IDREF’s, not just hierarchical trees. Since in XML order can matter, the language must therefore enable the specific ordering of elements that are output into the target graph.
  2. The language should be suitable to be run either on the server or the client.
  3. The language should support what Mr. Maier calls Selection, extraction, reduction, restructuring, and combination. The paper from IBM more specifically requires complex selection criteria and Boolean operators and I enthusiastically support this requirement.
  4. The language should not require the incoming XML to have a schema but should take advantage of one when present. (This paper believes that we need to amend XML to allow the inline description of schema without the requirement of validation).
  5. The language should preserve incoming order if requested to (a slight amendment from Mr. Maier’s list).
  6. The language should support programmatic manipulation and here I’ll extend this to require that the language itself be an XML grammar since we can, by definition, manipulation an XML grammar. Others suggest that the XML grammar be an optional syntax. I do not understand the tools strategy and documentation strategy for this. The IBM paper cited above requires that "It should be possible to embed XML queries within XML documents". I agree with this.
  7. The language should support data types including an extensibility mechanism. I’ll generalize this requirement to add that the language should support data types, extensibility on same, aggregation, extensibility on same, vector operations, and extensibility on same. More on this below.
  8. There should be an underlying algebra for the language.
  9. The language should make every effort to optimize well for stream-based processing where possible (slight rewording of Mr. Maeir’s requirement).
  10. The language should support parameterization and thus intelligent pre-compilation where possible. This is a slight modification of a requirement from the IBM paper cited above.
  11. IBM suggests that the language should use a "template model" for construction of XML. I agree with this for reasons discussed below, but primarily having to do with XML’s innate heterogeneity and frequent cyclic nature.
  12. The language should be comprehensible while maintaining efficient ways to describe the desired action. In short, it shouldn’t be too hard to use, too verbose to enter, or too hard to teach.

A Proposal

The rest of this paper is a proposal. As the W3C has some ongoing relevant work that is either shipping or will be shortly, namely XSL, I decided to try to define a query language for XML starting with the graph transformations operations within XSL. The key paper that influenced this proposal was the XML-QL paper submitted by Alin Deutsch, Mary Fernandez, Daniela Florescu, Alon Levy, and Dan Suciu. Their ideas are stolen everywhere. There was direct help from the folks who built LORE at Stanford, Jenifer widom, Roy… and their help is gratefully acknowledged. There was direct help from Alon Levy of XML-QL fame and his help is also gratefully acknowledged. Lastly there was a lot of help from some folks at Microsoft listed above. The mistakes and the lack of precision, however, are mine.

It is the assertion of this paper that XSL contains the germ of a query language. The paper defines a query language as one which takes as input a graph of data, applies suitable filters, and emits a new graph of suitable shape for the requested subset. SQL, for example, can be thought of as a language which takes as input a graph of data (a database) and emits a suitably shaped table, a very boring form of graph. Now any style sheet language must take as input any XML and emit as output any rich graph the paper requires, suitably filtering the input in the process. As such XSL contains the transformation primitives necessary for a query language. There are substantive limitations in the current model and even in the current implementations. As many may not be familiar with XSL, the paper first provides a primer on XSL’s transformation capabilities and its limitations. Those of you familiar with XSL will want to skip over at least the primer if not the limitations but should read the section on nomenclature as it is used throughout the paper. The paper then goes on to suggest specific extensions to XSL such that it can be used as a general graph to graph transform language.

A Primer on XSL

Why isn’t XSL like SQL?

XSL today has two key differences from SQL that have profoundly affected its language and model.

  1. It is intended to consume graphs that are innately unpredictable, irregular, and cyclic. Documents are, of course, the paradigmatic example of this.
  2. It is intended to output XML, not a table. As has been shown in the accompanying data-model document, this isn’t innately limiting because XML can describe the serialization of any graph. Thus the language contains what I call construction rules, intended to recursively describe how to process and emit the incoming graph. The language also uses as a building block "patterns" which are described in detail below, but which are used both to construct collections of incoming nodes to process and to handle heterogeneity by matching nodes against these "patterns".

These two differences led to the model of XSL today.

Review of XSL.

XSL can be thought of as an ordered collection of templates. Each template has an associated pattern (see below). These templates process input nodes that match the pattern. Each template contains a nested set of construction rules in XML. These rules describe how to process and emit the nodes that the template processes. Construction rules consist of an XML tree of construction nodes.

Some Nomenclature:

The industry hasn’t standardized fully on whether we call the items within XML "elements" or "nodes". Since "nodes" is the more general term and includes attributes, this paper will use the term "node". Similarly the terms "path expression" and "pattern" appear to be used with about equal frequency. This paper will use the term "pattern". We call the nodes that processing incoming nodes and emit outgoing nodes "Construction nodes". We call each incoming node {in}. Ocasionally in this paper we’ll call the emitted node {on}. For any node with a nodeName equal to some string s, we’ll call the node {s}. Construction rules are always scoped against {in}. For example, if the construction rules are being applied to nodes of type {PERSON} and the construction rules refer to NAME, they are referring to the child element {NAME} of {in}. If they refer to @Age they are referring to the "Age" attribute of {in}. If they don’t want to refer to something scoped against {in}, their expressions are explicit as in

@AGE > /LIMITS/MAX/AGE

which means that the "Age" attribute is being compared to the value of the {AGE} child element of the {MAX} node which is a child of the {LIMITS} node which is a child of the root. You will find more on this syntax under "Patterns" below.

More nomenclature. Deviations in this paper from XSL:

For purposes of clarity, this paper will violate XSL in a couple of ways. It will use the symbols ">", "<", "==", "&&", "all", any", and "||" rather than the currently approved XSL symbols "$gt$", "$lt$", "$eq$", "$and$, "$all$, "$any$, and "$or$. Similarly, XSL defines a function, ID(pattern) which doesn’t returns the ID of the node the pattern defines as one might think. Instead, it assumes that the pattern points to a node whose value is an IDREF that refers to another node with a matching ID, finds that other node, and returns it. Accordingly, in this paper we call this function "ref". Also, all attributes in this paper with the name ID are assumed to be of type ID.

 

Warning on nomenclature:

XSL has an attribute called "select". It actually has nothing to do with SQL selection. It is used to describe a pattern for constructing a vector of nodes. It should be called "pattern". But for consistancy with the existing XSL spec, this paper uses the current attribute "select" .

Each template is defined as a node with an attribute, "match" whose pattern (see below) describes a vector of nodes. If the incoming node being processed is contained within that vector, then the construction nodes contained within the template node are executed. For example,

<xsl:template match="PERSON">

some construction nodes here

</xsl:template>

would apply the nested construction nodes to all incoming nodes {in} that are {PERSON} nodes.

By default/convention, there is a template for the entire data-set with the pattern "/" as in

<xsl:template match="/">

construction rules to get started

</xsl:template>

The use of "xsl:" here assumes a containing node for the entire style sheet, normally <xsl:stylesheet> with the namespace suitably defined as in

<xsl:stylesheet xlmns:xsl="http//www.w3.org/TR/WD-xsl">

….everything else

</xsl:stylesheet>

Construction Rules

The simplest construction rule is simply well formed XML. Thus the construction rule

<SomeNode>Kilroy was here</SomeNode>

given any {in} will simply emit the node {SomeNode} containing the text "Kilroy was here".

<xsl:value-of>:

The next simplest construction rule is one that emits the text of the input node. The construction rule

<SomeNode><xsl:value-of /></SomeNode>

emits the node {SomeNode}" containing the text from {in}. This can be qualified. The node can have an attribute "select" set to a pattern (See below for details on patterns). For example, the construction rule

<SomeNode><xsl:value-of select="Name" /><BD><xsl:value-of select="Birthday"/></BD></SomeNode>

emits the node {SomeNode}. This node will contain text from the child node {Name} of {in} and will contain a childnode {BD} whose text is from the child node {Birthday} of {in}. Note. If there are multiple child names {Name} of {in} the "first" is chosen. If {in} isn’t a leaf node, the concatenated text of all of its children in depth order is emitted.

<xsl:copy>:

There is a construction rule to emit the entire input node back out. The construction rule

<xsl:copy attributes="@*"/>

will emit the {in} as an {out} along with all of its incoming attributes, but with none of its child elements.

 

<xsl:attribute>:

The construction node <xsl:attribute> sets the value of an attribute of the node that contains it. If the parent node doesn’t already contain the attribute, it adds it automatically. For example, the construction rule

<SomeNode>

<xsl:attribute name="foo"><xsl:value-of /></xsl:attribute>

</SomeNode>

emits a {SomeNode} with an attribute "foo" whose text value is equal to the text of the input node.

This can be very useful. For example given the incoming XML

<BUTTON Target="http://www.microsoft.com/xml">

<Description>I will take you to the Microsoft XML web site</Description>

</BUTTON>

the construction rule

<P>

<xsl:attribute name="onclick">=window.navigate(‘<xsl:value-of select="@Target"/>’);</xsl:attribute>

If you click on me, <xsl:value-of select="Description"/>

</P>

would emit a classic HTML {P} node complete with target navigation if the user clicked on it. The resulting graph emitted, shown as HTML, in other words, would be

<P onclick="window.navigate(‘http://www.microsoft.com/xml’);">If you click on me, I will take you to the Microsoft XML web site</P>

 

<xsl:if>:

Sometimes the {in} nodes will be of varying types. To handle heterogeneity, there is a construction rule that will only fire if the incoming node fits a pattern. The construction rule

<xsl:if match="Some Pattern">

<SomeNode>…some construction rules</SomeNode>

</xsl:if>

will only emit the node {Somenode} is {in} is logically within the pattern described by the attribute match. For example

<xsl:if match="PERSON[@age>40]">

<OldFogey><xsl:value-of select="Name"/></OldFogey>

</xsl:if>

will emit {OldFogey} nodes only for those {in}’s that were {PERSON} nodes and had the value of their attribute "age" > 40.

 

<xsl:apply-templates>:

More generally there is a construction rule that says to use the first ordered template whose pattern describes a vector that the node {in} is logically within. Thus the construction rule

<xsl:apply-templates/>

will simply delegate the construction for {in} to the appropriate template. If there is no match anywhere, nothing is emitted. Otherwise the appropriate {on} is emitted as described by the construction rules within the template. Templates may cycle and recurse. Again, this construction rule can be qualified by a pattern. The construction rule

<xsl:apply-templates select="PERSON"/>

will create a vector of nodes for all {PERSON} child nodes contained in {in} and then choose the appropriate template to use for the construction rule of each. Note that these might differ. For example there might be the following templates

<xsl:template match ="PERSON[@salary > 50000]" >

<xsl:appl-templates/>

some construction rules

</xsl:template>

<xsl:template match = "PERSON[@salary $le$ 50000]">

<xsl:apply-templates/>

some different construction rules

</xsl:template>

In this case some of the {in}’s fed into the apply-template would be processed by the first template and some by the second. Note. If an {in} to an apply-template matches against more than one template, the first one encountered in the style-sheet wins. Also notice that these templates can handle cycles. If {PERSON} nodes contain {PERSON} nodes and so on, the templates used for processing the first ones encountered will recursively apply the appropriate ones and so on. In short, cycles within the incoming XML pose no problem.

 

<xsl:for-each>:

Everything thus far allows us to construct XML that handles heterogeneity and irregularity and cycles. There are a host of other more minor ones to emit custom XML tags like comments or PI’s or CDATA or entity references or DTD declarations, copy all attributes of {in} to the output node, and so on. None of these really matter. The last construction rule that really matters is one that in many ways will be most familiar to the SQL user. The construction rule

<xsl:for-each select="Some Pattern">…<Construction rules>..</xsl:for-each>

will process in order all of the nodes described by the pattern (see below). Thus the construction rule

<xsl:for-each select="//PERSON">

<Gente><xsl:value-of select="NAME"/></Gente>

</xsl:for-each>

will emit an output node {Gente} for each {in} of type {PERSON} containing the text for the child node {NAME}. At Microsoft, we have extended this syntax to support an explicit ordering so that the construction rule

<xsl:for-each select="//PERSON" order-by="+ NAME">

<Gente>

<Nombre><xsl:value-of select="NAME"></Nombre>

<Anos><xsl:value-of select="AGE"/></Anos>

</Gente>

</xsl:for-each>

will emit for each {in} (which will be limited to all {PERSON} nodes in the input) the {on} {Gente} containing the child node {Nombre} with the text from the child node {NAME} of {in} and the child node {Anos} with the text from the child node {AGE} of {in}, all ordered by name. Notice that the attribute select really should be called "from" in SQL parlance. Why? Well in SQL, FROM describes which records are to be used as input for construction and in XML the closest direct corollary to records are nodes and the pattern describes which nodes are to be used as input for the construction rules.

 

Basic Patterns:

Today, the pattern is used to define a vector of node pointers where the nodes being referenced are those described in the rightmost element of the pattern. For example, the pattern

/PERSON/ADDRESS

will create a vector of node pointers to all {ADDRESS} that are direct children of all {PERSON} nodes that are direct children of the root.

 

Branching:

It is possible to branch. For example, the pattern

/(PERSON | COMPANY)/ADDRESS

will create a vector of node pointers to all {ADDRESS} nodes that are direct children of either a {PERSON} or a {COMPANY}.

 

Predicates:

Each element within the pattern can contain an optional limiting predicate. This filters the nodes at that level. For example, the pattern

/PERSON[@Age == 50]/ADDRESS

will create a vector of node pointers to all {ADDRESS} nodes that are direct children of any {PERSON} node whose AGE attribute is equal to 50. In general, one can describe these patterns as being of the form

(/TagName["["Predicate"]"] | "@"AttributeName))+.

There can only be one "@AttributeName" within the pattern (other than within predicates and within ref’s described below) and it must be at the end of the pattern. This is because there is no structure today in XML attributes.

Recursion:

The patterns can run recursively using the "//" operator. Thus the pattern

//PERSON//ADDRESS

will create a vector of node pointers to all ADDRESS nodes contained at any level of depth within all PERSON nodes contained at any level of depth within the document.

Note. The pattern must, today stop if an attribute is chosen since attributes don’t contain children. For example, the pattern

/PERSON/@AGE/NAME

isn’t legal because there can be no NAME elements that are direct children of the @AGE attribute.

Handling predicates on sub-vectors.

The syntax of the predicate is actually more complex that it first appears. Imagine the following XML:

<PERSON Age=39>

<RELATIVE Title="Uncle" Name="Bob" Age=75">

<RELATIVE Title="Dad" Name="Stanley" Age=71">

<RELATIVE Title="Mom" Name="Anne" Age=69">

<RELATIVE Title="Sister" Name="Virginia" Age=30">

</PERSON>

The pattern

/PERSON[@Age > 40]

is fairly straight-forward. It produces a vector of {PERSON}’s with the age attribute > 40. But what about the pattern

/PERSON[RELATIVE/@Age > 40]

Does this mean all {PERSON} nodes for whom there exists at least one {RELATIVE} child node with the attribute age > 40? Does it mean all {PERSON} nodes for whom all {RELATIVE} child nodes have the attribute age > 40? Does it mean check the first one? Well, patterns define two operators, "any" and "all". The first means find at least one child node for which the predicate holds true. The latter means don’t process unless the predicate holds true for all child nodes. The current implementation default is to pick the first one if neither all or any are specified, but this is probably wrong. Thus the predicate above really should be either

/PERSON[any RELATIVE/@Age > 40] or

/PERSON[all RELATIVE/@Age > 40]

Traversing through IDREF’s:

While you can traverse IDREF’s, the model isn’t transparent. This is the function this paper calls "ref". Thus given the XML

<PERSON WORKING-ON="4"/>, and

<PROJECT Name="CriticalProject 4 ID="4"/>"

the pattern

"//PERSON/ref(@WORKING-ON)/@Name

will return the attribute node with the value "CriticalProject". This pattern first finds all {PERSON} nodes anywhere in the incoming XML. For each one, it uses their "Working-On" attribute to find a referenced node with a matching ID attribute. For each of those, it finds the "Name" attribute. This function can be used whenever a node would be returned. As another example, the pattern

/PERSON/(ref(@work) | ref(@home))/@zip

will return a vector of the zip-codes of any person listing both their work and their home zip codes.

 

NodeName():

NodeName() is a legal function to use within a predicate. It returns the nodeName of the node at that level of the pattern. There is also a construction node <xsl:node-name/> which outputs as text the name of the {in} node.

Note on a current limitation:

Today, today patterns cannot traverse XML-LINK’s. For example, given the XML:

<PERSON><WORKING-ON xml-link="simple" href="?/PROJECT[@ID == 4]"/></PERSON>, and

<PROJECT Name="CriticalProject 4 ID="4"/>"

(we assume here that the query of the locator is expressed using the pattern syntax), it isn’t legal to write the pattern

"PERSON/WORKING-ON/@NAME

to traverse from the {PERSON} through the {WORKING-ON} XML-LINK node to the {PROJECT} node and then reference its @NAME attribute. In other words, the following isn’t legal

"//PERSON/ref(WORKING-ON)/@Name

since the value of the element WORKING-ON isn’t an ID that can be looked up for a node.

 

Differences between patterns that create vectors and that check membership in some vector.

Patterns are slightly different when used for checking for inclusion in a match attribute. Remember the vector associated with a normal pattern is computed from left to right starting with either the {in} node or the entire document if the pattern begins with "/". However, the pattern used for checking membership is evaluated from right to left. First the rightmost element is checked to see if it is the same kind of node as {in}. If it is, then the pattern moves leftwards checking to see if the node’s parent is correct and so on. It is a subtle distinction, but occasionally an important one and will need some enhancing, as we shall see.

For example, in general, when the pattern is used to describe constructing a vector, the pattern

/PERSON[@age > 50]

will find all nodes that are direct children of the root node of type {PERSON} and then filter for those whose attribute "age" has a value greater than 50. But, if being used as a check for inclusion, the pattern

PERSON[@age > 50]

Is used since the pattern is a test of the node to see if it is of the type {PERSON} regardless of what its parent is.

Limitations of XSL

Today, there are a host of limitations of XSL.

Proposed Extensions to XSL

Generalize Links and links traversal.

As discussed elsewhere, much of the data XML will be used to describe is likely to be a graph, not just a tree. XML between XML:LINK and ID/IDREF’s can describe any graph, even a semi-structured one without any particular schema. We need to enhance the patterns to traverse graphs more robustly. Thus the following proposals are made for enhancing XSL.

Extend patterns with **:

Extend the patterns to support /**/ meaning find any child node or attribute. While there are important distinctions between attributes and child nodes, it isn’t reasonable to assume that all XML grammars or instances will make these distinctions. For example, we might encounter XML like the following:

<PERSON id=5, Name="Adam Bosworth worksfor="6" annoys="4">

<Employs>7</Employs>

<Employs>8</Employs>

<Employs>9</Employs>

<Employs>10</Employs>

</PERSON>

<PERSON id=6 Name="DavidV"/>

<PERSON id=4 Name="Bill Gates"/>

This is not recommended XML. The use of implicit IDREF’s as the values inside of elements is put here merely to show what may happen, not what should and why the query language must be able to ask questions over both attributes and elements easily. This extension, "**", in and of itself doesn’t help traversal. It is merely equivalent to "(* | @*)". However, Consider the pattern

/PERSON[Name=’Adam Bosworth’]/ref(**)

This is now a pattern that searches every node that my {PERSON} node either contains or references overtly.

Containment recursion:

Extend Patterns for "path expressions" or recursion.

Generalize the patterns to support recursion on any element or set of elements of the pattern using (element|pattern fragment)+|Number. Thus the pattern

/(part)+/brand

will return all vectors found with one or more containing {part} nodes. Similarly,

/(PERSON | EMPLOYEE | MANAGER)*/Name

will return all the names of any level of containment of {PERSON}, {EMPLOYEE}, and {MANAGER} nodes. See "Path Expressions in the XML-QL" submission for more details.

Extend ref:

Extend ref to automatically traverse through XML:LINK nodes, not just IDREF’s. Given the XML

<PERSON><WORKING-ON xml-link="simple" href="#4"/></PERSON>

<PROJECT Name="CriticalProject 4 ID="4"/>"

the pattern

/PERSON/ref(WORKING-ON)/@NAME

will return the node {@NAME} containing the text "CriticalProject". The query language will automatically traverse through XML-LINK’s whose Xpointer is in pattern syntax so that given the XML

<PERSON>

<WORKING-ON xml-link="simple" href="?XML-PTR=/PROJECT[@Name="Critical]"/>

</PERSON>

<PROJECT Name="Critical 4 ID="4"/><COST units="K">50</COST></PROJECT>

the pattern

/PERSON/ref(WORKING-ON)/COST

will return the {COST} node containing the attribute "units" equal to "K" and the text "50".

 

Extend ref to describe the target node type:

Extend the ref function to take an optional argument describing the pattern that the target node must match. Thus the pattern

/PERSON[@Name=’Adam Bosworth’]/ref(@*,PERSON)

will return a vector of all the {PERSON}’s related to me through any IDREF.

Extend ref to describe both the target node type and the element to be used as the lookup key:

Thus the pattern

/PERSON[@Name=’Adam Bosworth’]/ref(@worksfor,PERSON,@empid)

will return a vector of all {PERSON}’s whose empid attribute value matches that of the my worksfor attribute.

Extend ref to allow recursion:

Extend the ref function to take a fourth optional argument. This argument is used for recursion to handle cyclic relationship. This is necessary because the "//" operator is essentially unbounded in a graph and expensive to compute if one assumes trapping on nodes already visited. Thus the pattern

/PERSON[@Name=’Adam Bosworth’]/ref(@*,PERSON,,5)/@Name

will return all the names of anyone related to me, even 5 levels away. If the number passed in here is "//", then keep searching until all nodes processed or have run out. If there is no initial argument, then search for containment.

Extend patterns to allow "inverse" traversal of an edge:

Extend the patterns to include a "backref" function. This takes the same arguments as ref. This inverts the transition from a known IDREF to an ID. The pattern within it describes a vector of nodes which are presumed to contain IDREF’s to the current node either through the ID attribute of the current node or through an explicit attribute if named. For example, the pattern

/ADDRESS[@zip=’11201’]/backref(@work)/@name

will find the names of all the people working at locations with zip code 11201. This pattern says to find all {ADDRESS}’s with zip attributes with value 11201. Then for each of these, find any node that refers to these {ADDRESS} nodes with an @work relationship. In other words, find all nodes with an "@work" attribute that is an IDREF to an {ADDRESS} node. This will return the {PERSON} nodes which refer to these addresses. Then for each of these {PERSON} nodes, it returns their names.

Extensions for Variables

Extend patterns. Generalize from

(/TagName["["predicate"]"])+

to

(/TagName["["predicate"]"][as $name])+

Call these "names" variables. If they are defined, then rather than a pattern describing a vector of nodes, a pattern describes a vector of tuples of nodes, each named with a variable. Each variable should be thought of as a node that is a stand-in for the node described in the pattern. Let’s look at some examples. Today the following is a legal pattern:

/PERSON[@name=’Adam’]/ADDRESS[@street = ‘9324 SE 57th St"]

But it returns a vector of {ADDRESS} nodes. It doesn’t provide the {PERSON} nodes.

However, the pattern

/PERSON[@name=’Adam’] as $p/ADDRESS[@street = ‘9324 SE 57th St"]

will now return a vector of tuples. Each node "contains" both the {ADDRESS} nodes and {$P} nodes where the {$P} nodes are actually of type {PARENT} nodes. Note that this is different from today’s model where each {in} is an {ADDRESS} node.

These variables since they represent nodes can be used to begin a path. The following pattern

/PERSON[@name=’Adam’] as $p; $p/ADDRESS[@street = ‘9324 SE 57th St’] as $a

will return a vector of tuples of $p nodes and $a nodes.

We use the "$" to prefix variables in order to disambiguate between nodes with a name and variables with a name . For example, given the pattern

/PERSON[@name=’Adam’] as p; p/ADDRESS[@street = ‘9324 SE 57th St’] as a

It would be ambiguous what the second pattern is starting with. Is it asking for all nodes of type {p} or all {PERSON} nodes limited to those with the attribute "name" equal to "Adam"?

Tag Variables:

Some of the proposed query languages have constructed "tag variables" to remember the path or to filter the path followed. For example, the XML-QL submission includes an example where the query finds all publications published in 1995 in which smith is either an author or an editor. Given the existence of the DOM and the nodeName() function within patterns this is technically unnecessary. The XML-QL example

WHERE <$p>

<title>$t</title>

<year>1995</>

<$e>Smith</>

</> IN "www.a.b.c/bib.xml", $e IN [author,editor]

CONSTRUCT <$p>

<title>$t</title>

<$e><Smith</>

</>

would, in XSL, be

<xsl:for-each select="/* as $p/title[year == 1995] as $t/(author|editor) as $e">

<xsl:copy select="$p">

<title><xsl:value-of select="$t"/></title>

<xsl:copy select="$e/>

</xsl:copy>

</xsl:for-each>

However, there are queries that are hard or impossible to do in XSL with this construct. Consider the query in which I want to list of pairs of related people and the relation between them. I can give the pattern

//PERSON as $p1/ref("**",PERSON) as p2

and this will create a vector of tuples of all pairs of related people, but not how they are related. Since XML is as much about semi-structured data as structured data, schema discovery is an important query goal and as such this is a reasonable question. Accordingly, and only until something more elegant comes to mind, this paper proposes the syntax

//pattern element as $name/ref("pattern" as $name,…) as $name…

so that, for example, the pattern

//PERSON as $p1/ref("**" as $relation,PERSON) as $p2

will return a tuple of nodes where the $relation node will hold the linking node if one exists (an attribute or an XML-LINK) and an empty node if the relation is a contains relation between the $p1 and $p2 nodes. As such, of course, we can use these "tag" variables within the "where" attribute to limit relations as we see fit. Similarly, the pattern

//PERSON as $p1/ref("**", $t) as $p2

returns all nodes related to PERSON unless the "where" attribute is used to limit the target types.

 

Add support for one important type of variables, namely parameters:

Include

(<xsl:parameter [prompt=".."] [type="some DCD type"] as="$name"/>)*

inside the main query. This allows, if some other assumptions listed below are met, the parameterization of the query. Each name associated with the "as" attribute will be usable within the query as a variable.

Scoping of variables:

Clearly, once we introduce variables within this architecture, we will have to answer questions about scope. Are these variables global? Do they have nested scoping like Pascal? Is it static or dynamic? Since this entire proposal is nothing but a straw-man to show that XSL can be extended, the paper deliberately doesn’t tackle this question.

Generalize the for-each.

The proposed generalizations below enable the <xsl:for-each> construction node to describe fully general selection, restructuring, reduction, and combination across any set of incoming XML graphs. In order to do this we introduce the idea that each pattern that describes a vector may have an associate "src" attribute which points to some XML graph through a URL. We also introduce the idea that variables may be stand-ins for expressions rather than nodes to allow selection on expressions and construction of nodes with values of these expressions.

Simple form:

The simplest form of generalization for the for-each is roughly the one today and would be:

<xsl:for-each

select="(pattern";")+"

where="boolean expression defined over the names from the xsl:from’s"

/>

Construction rules as normal

</xsl:for-each>

There are several new things here. This form allows multiple patterns separated by semi-colons. If any variables are defined then rather than the pattern producing a vector of nodes, it will produce a vector of tuples of nodes whose "nodeName" will be the variable name. Secondly, the "where" attribute now contains a general predicate with general boolean logic capable of relating the nodes describes by the variables in as complex and extensible a manner as is necessary. Conceptually, the tuples will be produced using a cross product across the vectors unless the vectors directly derive from a variable. For example

<xsl:for-each

select="/PERSON as $p; $p/ref(RELATIVE) as $r"

where "$r/@age < $p/@age"

>

<xsl:copy select="$p">

<YOUNGREL><xsl:value-of select="$r/@NAME"/></YOUNGREL>

</xsl:copy>

</xsl:for-each>

would take as input pairs of nodes $p of type {PERSON} and $r of type {RELATIVE} for all relatives for any given person who were younger than that person and emit {PERSON} nodes containing {YOUNGREL}s with the names of their younger relatives.

General form:

This adds four types of child nodes to <xsl:for-each>: <xsl:from>:, <xsl:vector>, <xsl:order-by> and <xsl:set>. The most general form of the construction rule <xsl:for-each> would be

<xsl:for-each

where="boolean expression defined over the names from the xsl:from’s">

[<xsl:vector [type="Union|Intersection|Xproduct"]>]

(<xsl:vector>…</xsl:vector>|

<xsl:from select="pattern" [as="name"] [src="url"]/>

)+

[</xsl:vector>]

[<xsl:order-by select="pattern" [direction="+|-"] />]*

[<xsl:set value="[pattern/]expression" as="name"]*

Construction rules as normal

</xsl:for-each>

Why generalize "select" from an attribute to a collection of nodes? Well, anytime we have to start parsing attributes for semi-colons, it is a sure sign that the attribute should really be a collection of elements. Thus the "simple form" above should merely be thought of as a convenient abbreviation. The semantics of this construction rule is that it generates a vector of tuples of nodes. The "where" attribute allows the restriction of nodes in the resulting vector across all vectors described by the "xsl:from" elements using their variable names and classic SQL joining syntax. What are these "<xsl:vector> nodes? Well, as Mr. Bray has eloquently described in his submission "Element Sets: A minimal Basis for an XML Query Engine" and as SQL has shown us, there are numerous interesting vector operations that may make sense given a set of <xsl:from> nodes. Hence the "type" attribute. It is deliberately left an attribute so that we have some built in extensibility in this arena. However, we expect the SQL "Crossproduct" to be the most common and thus if "type" is omitted, "Xproduct" is assumed to be the default

For example, given the XML

<PERSON> <Name>Mike Maples</Name>….. <TITLE>VA</TITLE>..</PERSON>

<PERSON> <Name>Adam Bosworth</Name>….. <TITLE>GD</TITLE>..</PERSON>

<PERSON><Name> Istvan Cseri</Name>….. <TITLE>SA</TITLE>..</PERSON>

<TITLES>

<TITLE SN="VA">On vacation</TITLE>

<TITLE SN="SA">Systems Architect"</TITLE>

<TITLE SN="GD">General Drone (aka Manager)</TITLE>

the construction rule

<xsl:for-each where="$p/title= = $t/@sn">

<xsl:from select="/PERSON" as="$p"/>

<xsl:from select="/TITLES/TITLE" as="$t"/>

<xsl:sort select="$p/Name" direction="+"/>

<p>…<xsl:value-of select="$p/Name"/> is a <xsl:value-of select="$t"/>.</p>

</xsl:for-each>

would process all tuples of {PERSON} nodes $p paired with nodes {TITLE} $t restricted to those pairs in which the "sn" attribute of the {TITLE} node matches the {Title} child node of the {PERSON} node. In short it will list out the sentences that describe for each {PERSON} node, each person’s name and title. In general, joins will not be necessary as links support will solve the problem more elegantly. For example, this example could have been written as:

<xsl:for-each select="/PERSON" order-by="+Name">

<p><xsl:value-of select="Name"/> is a

<xsl:value-of select="ref(TITLE,/TITLES/TITLE,@SN)"/>.

<p>

</xsl:for-each>

Remember, the ref here will treat each /PERSON/TITLE as an ID, look up all nodes of type /TITLES/TITLE whose attribute SN matches this ID, and then emit the value of the selected matching node.

The "src" attribute in <xsl:from>:

Each For-each contains not one, but multiple patterns. Each pattern has an optional "in <URL> which allows the joining of data from multiple different data sources.

<xsl:order-by>:

Each For-each may contain a collection of Order-by elements defined across the tuples above where the attribute describes a pattern on them which should return a single value or sequence of values for each row in the cross-product.

<xsl:set>:

The <xsl:set> construction node creates named expressions that occur in the tuples as value nodes. For example

<xsl:for-each select="ORDER/LINEITEM">

<xsl:set value="@price * @quantity" as="$total"/>

<total><xsl:value-of select="$total"></total>

</xsl:for-each>

will emit a {total} node with the total value of each line item.

 

The "Where attribute" and Predicates:

The current XSL language doesn’t support expressions. For example, given the construction rule:

<xsl:for-each select="/ORDER/LINEITEM[(@price * @quantity) < 50]"

isn’t legal XSL today. We propose to fix this in two ways. The "Where" attribute of <xsl:for-each>, as is shown above, takes complex boolean expressions scoped by the variables defined either in the <xsl:from> elements or in the <xsl:set> elements. Thus the following construction rule

<xsl:for-each select="ORDER/LINEITEM" where="$total < 50">

<xsl:set value="@price * @quantity" as="$total"/>

</xsl:for-each>

will process only those {LineItems} whose (@price * @quantity) are less than 50. These should not be necessary for simple patterns so that, in fact, it should be possible to extend the pattern predicates such that the example above could be written as

<xsl:for-each select="ORDER/LINEITEM[(@price * @quantity) < 50">

</xsl:for-each>

 

Node compares versus value compares:

We’ve deliberately not used "=" in this paper, using "==" instead. This is because we wanted to reserve "=" for use as a "value" compare. For nodes that are atomic values, the distinction is moot. But for nodes that are intermediate nodes, one can imagine two sorts of comparisons. The first is an identity comparison asking if the nodes are actually the same. The second is a deep comparison checking the values of all of their sub-elements and attributes, value by value. It is assumed that we would use "=" for this latter meaning and that using it for atomic nodes against literals or expressions would be harmless.

Generalized Cleanup of XSL

Text Operators:

Fix the operators on text to be extensible and, of course, to allow much richer text comparisons such as sounds-like, full regular expressions, and so on. As with aggregates, there should be a built in set that is defined and mandatory and an optional extensible set.

Any and All

Fix any and all to operate on full binary expressions, not just binding to the right. For example,

any(Salary > 50 And Level>3)

will return true if any node in the vector described by the nodes that Salary and Level are scoped against was true for this expression. In other words for at least one one, both must be true.

all(Salary > 50 And Level>3)

will return true only if this expression were true for all nodes in the vector.

Construction Extensions:

Expressions:

As shown above, we added a construct for creating variables from expressions. And once we have the variables, we can emit these nodes. This shouldn’t be necessary if the expression is just to be used and emitted once. But for construction rules, we cannot just enter

<xsl:for-each select="/ORDER/LINEITEM">

<foobar>@price * @quantity</foobar>

</xsl:for-each>

because this would create a set of {foobar} nodes containing the text "@price * @quantity". Accordingly, we should extend path expressions to allow the rightmost element to be an expression. Thus the example above can now be handled using

<xsl:for-each select="/ORDER/LINEITEM">

<foobar><xsl:value-of select="@price * @quantity"/></foobar>

</xsl:for-each>

of course, this highlights a key difference between the use of "select" as a pattern for constructing a vector of nodes and "select" here which uses the pattern to return a single node. This use of select really is different from the other and it is suggested that we deprecate "select" and use "value" instead. There will be times when we want to name these expressions for future reference and times when we want to declare their data-type. Thus it is suggested that the general model for <xsl:value-of> be extended to

<xsl:value-of value="pattern../expression" [as="$name"] [type="A DCD type"]/>.

Sharing a target node:

In the literature, this is commonly known as "fusion". One limitation of this proposed model, even with both the link support and the ability to retrieve and join together data from multiple data sources, is the ability to have separate construction rules target the same {on} or output node. To construct graphs, this is clearly necessary. The general model for so doing is to identify the {om} by a unique ID with the understanding that the query engine will only build one such node and thus this node will be a common target across construction rules. We are going to assume the existence of Skolem functions called SK({}) which yield a unique ID for any input node that they are given. In theory, this is all we need. We could thus describe the ID attribute of the node using the xsl:attribute contruction node. For example, if the Skolen function is called "Skolem(pattern)", the construction rule

<xsl:for-each select="//Employee as $e">

...<Emp>...<xsl:attribute name="id">Skolem($e)</xsl:attribute>..</emp>

</xsl:for-each>

will create an {Emp} node whose ID attribute is uniquely related to the {in} of type {Employee}. Other construction rules targeting an {Emp} node with the same ID attribute would be considered to be targeting the same output node. However, this is a verbose syntax for such a common operation. Accordingly, I suggest a construction node,

<xsl:unique [fcnname="some Skolem function name"] [select="pattern"] [ta="Some target attribute"]/>

to set the parent’s ID attribute. If there is no "select" attribute the {in} node is used for computing the Skolem. If there is no "fcnname" attribute, some default Skolem function is used. If there is no "ta" attribute, the target attribute of the parent node is the ID of the parent {on}. Thus the example above becomes

<xsl:for-each select="//Employee">

...<Emp>...<xsl:unique/>..</emp>

</xsl:for-each>

Referencing created nodes:

It will also be common to want to create edges to existing nodes when creating a graph, whether to existing {in} nodes or to newly created {on} nodes. For example, if I am processing incoming tasks and employees and creating new nodes associated with the tasks, I may well want them to include either references to the original task or to {on} nodes created with <xsl:unique> from the original tasks. The same operator can be used, simply targeting the appropriate attribute to contain the IDREF.

Given the incoming XML

<Employee ID="3" Name="Adam Bosworth" Age="43"..>

<Task AssignedTo="3" Priority="0".Name="Sweep the floor".>

<Task AssignedTo="3" Priority="1" Name="Check the algebra"..>

<Task AssignedTo="3" Priority="2" Name="Design the grammar"..>

the following construction rule would create new {dude} nodes from the incoming {Employee nodes}, new {busy}, {hard}, or {fun} items from the incoming {Task} nodes, and have all of these point to the right {dude} nodes.

<xsl:for-each select="/Employee">

<dude><xsl:unique/><who><xsl:value-of select="@Name"/></who></dude>

</xsl:for-each>

<xsl:for-each select="/Task as $t">

<xsl:if match="$t[@Priority == 0]">

<busy><xsl:unique ta="who" select="$t/ref(@AssignedTo)"<value-of select="$t/@Name"></busy>

</xsl:if>

<xsl:if match="$t[@Priority == 1]">

<hard><xsl:unique ta="who" select="$t/ref(@AssignedTo)"<value-of select="$t/@Name"></hard>

</xsl:if>

<xsl:if match="$t[@Priority == 2]">

<real><xsl:unique ta="who" select="$t/ref(@AssignedTo)"<value-of select="$t/@Name"></real>

</xsl:if>

</xsl:for-each>

this works because the new {dude} nodes are given a unique ID computing from the {in} {Employee} nodes and the new {busy},{hard}, or {fun} nodes have their "who" attribute set to the same ID using the linking work discussed above.

Extensibility:

No matter what expression language we can agree upon, there will always be the need to allow procedural code to play a role in constructing the {on}’s. The obvious model for this is to allow any <xsl:for-each> node to include "events" for any language describing which function should be invoked, what language it is, and a standard model for the event. For example, on can imagine

<xsl:for-each…>

<xsl:event language="Java" "interface="onCreateNode">

<![CDATA[

class handleCreate extends XSLEvent {

public void createNode(DOMNode in, DOMNode out) {

Do some stuff to update out using information from in.

}

public void nodeChanged(…) {…}

}

]]>

</xsl:event>

as a way to define events for a construction rule. The "CDATA" part is odious, of course, but then see changes really required in XML for more on this below.

Aggregates:

Review

Many query languages don’t do a good job of handling aggregates. For example, while it is possible with considerable difficulty in SQL to select all employees who earn more than the average for their level, it isn’t possible to show for each one what that employee’s ratio to the average actually is. It is also hard to impossible in SQL to construct a query that produces, for every day of the year, how many employees have a birthday on that day and who exactly they are anyway. There are essentially four operations required to make all these things possible:

Aggregation:

Aggregation essentially describes a function which given a vector of nodes (usually ordered) computes a single value. It will usually have 3 steps: (initialize, update, retrieve value) that it will go through with the sequence being initialize, for each {in} update, and then retrieve. For efficiency/optimization purposes it will often have a fourth step that allows intermediate results to be passed into it. While we want this architecture to be extensible to include the common types of aggregates not currently included such as Mode, Median, and so on, it seems that there is probably a minimum set that any query language ought to support.

Grouping and scopes:

For any aggregate, there is typically a spanning vector of input values. For example, if the case of employees who earn more than the average for their division, the aggregate "average" or "mean" is computed repeatedly for each vector of employees per division. It might also be interesting to compute an aggregate over a larger scope such as the overall "average" salary of the employees. There the scope is all records. In order to be able to distinguish between these two cases, the expression computing the aggregate value must make it clear what the underlying scope is. However, this is a bit more complex than it might first appear. Imagine that we know the birthdays of all employees and want to show the total employees by month. Is this month across all years or simply within a year? Clearly it matters. Thus the scope must be clear. It is common to think of this being resolved using "Grouping". By "Grouping" I mean a process in which data is, logically speaking, sorted into bins and the aggregates are computed over these bins.

Variables for Aggregates:

Most query languages allow the use of aggregates for restricting data, (e.g. find all employees who earn more than the average for the division), but not the use of the same aggregates for actual construction of values that are emitted. Yet, by far the easiest model for customers to understand in this regard is one in which the aggregates are named and then may be referenced in expressions. For example, if the language is to return the salary that the employee earns as a percentage of the total for the division, the easiest model is to name the aggregate that represents the total for the division and then emit an expression dividing the

employee’s salary by that name.

Define vectors of interest:

As with the birthdays example above, it is common to want to create a query in which a set of values are emitted, even if these values don’t occur in any instance in the database. For example, there may be dates for which no employee has a birthday on that date. But we still might want the query to emit an entry for that date simply noting that no employees were born then. One can work around this in SQL with trouble by creating dummy rows in dummy tables, but it seems that the query language should make it easy to create vectors of interesting values, either derived from the data or simply described as a desired set to start with. One might call this creating virtual rows. These vectors may have fixed starting and ending values or may derive them from aggregates such as the Min or Max over some scope. They may be partitioned by a function (log of the value for example) or by an interval described numerically for numbers, in terms of time for date and time, in terms of text ranges for text over some letter and so on. All these partitioning functions clearly share a common model. They take a starting value, possible derived, an ending value, also possibly derived and emit a vector of nodes with a starting and ending value. We’ll call these two nodes {min} and {max} It seems that it will be useful to have the workings of at least some of these "known" to the query language for optimization purposes.

The proposed model below is inspired by these four requirements.

Proposed Model

Distinct:

Extend <xsl:for-each> and <xsl:from> to allow an optional attribute "distinct" with legal values of "values" or "nodes". Omission means that distinct isn’t required. As an example, the construction rule:

<xsl: for-each select="$D" distinct="values"|"nodes">

<Group>

<Division><xsl:value-of select="$D"></Division>

<EmployeesForDivision>

<xsl:for-each select = "/EMPLOYEE">

<xsl:set value="@Division" as="$D">

….

</xsl:for-each>

</EmployeesForDivision>

</Group>

</xsl:for-each>

would emit a {Group} node for each unique value of @Division for which at least one employee exists. Contained within this {Group} node, would be a {Division} node containing the unique value and an {EmployeesForDivision} node containing the {Employee} nodes for the division.

<xsl:agg>:

Introduce a new construction operator,

<xsl:agg type="some agg fcn" select="pattern" as ="name"/>

this construction operator associates a name given by the "as" attribute to an aggregate computed by a function names in the "type" attribute over a vector of nodes described by the pattern in the "select" attribute. The scope is the range of the containing <xsl:for-each> construction node. The construction rule

<xsl:for-each select="$L" distinct="values" as="$DL">

<xsl:agg type="sum" select="$P" as="$PoverLetter" />

<xsl:for-each select="STATE"…>

<xsl:agg type="sum" select="$P" as="$PoverState"/>

<xsl:for-each select="CITY" as="$c" where="$L == $DL">

<xsl:set value="firstLetter(@Name)" as="$L"/>

<City><xsl:value-of select="@Name"/>

<pop>

<xsl:value-of value="@Population" as="$P"/>

</pop>

<xsl:set value="$P/$PoverLetter" as="$PctLet"/>

<PctLet><xsl:value-of value="$PctLet"/></PctLet>

<xsl:value-of value="$P/$PoverState" as="$PctState/">

<PctState><xsl:value-of select="$PctState"/></PctState>

</City>

</xsl:for-each>

</xsl:for-each>

</xsl:for-each>

will emit XML looking something like

<City>New York<pop>8000000</pop><PctLet>80</PctLet><PctState>60</PctState></City>

..

with nodes for each city containing as a value the percentage the populating of the city is of the total of cities beginning with the same first letter and {pctState} nodes containing as a value the percentage the population of the city is of all cities in the same state that also beginning with the same first letter.

<xsl:range>:

Add a new construction operator for building a vector of nodes,

<xsl:range [type = "some legal type as described by DCD"]

[fcn="name of chunking function"] as="name">

<xsl:start [value=""]>construction rule to emit a start value</xsl:start>

<xsl:stop [value=""]>construction rule to emit a stop value</xsl:stop>

<xsl:count [value=""]>construction rule to emit a count</xsl:count>

<xsl:interval>construction rule to emit a desired interval</xsl:interval>

</xsl:range>

which can define the sorts of vectors described above. The {start} is mandatory, but can be set using the value attribute. Of the {stop}, {interval}, and {count} elements, any two are mandatory. It is assumed that given any two the third can be computed. For example, the construction rule

<xsl:for-each>

<xsl:range type="int" fcn=".." as="$chunk>

<xsl:start value="1"/>

<xsl:stop><xsl:value-of select="$max"/></xsl:stop>

<xsl:count value="100"/>

</xsl:range>

<xsl:agg type="Max" select="$s" as="$max"/>

<rangeofemployees><xsl:copy select="$chunk"/>

<xsl:for-each select="/EMPLOYEE"

where="$s > $chunk/min and $s < $chunk/max">

<xsl:set select="@salary" as="$s"/>

construction rules to emit employee data

</xsl:for>

</rangeofemployees>

</xsl:for>

will emit a collection of {rangeofemployees} nodes, each containing a node with child {min} and {max} nodes and a collection of the nodes produced by the production rules for the employees whose salaries where in the right range. This model makes heavy use of variables in a somewhat new way. The "$s" defined inside of the for-each on {EMPLOYEE}’s, is used both to compute the max for the group in the outer level and to restrict the employees based upon the interval.

 

Group-By

This paper doesn’t, somewhat surprisingly, suggest a new group-by construction node. It may be that we will need one, but it seemed that the ability to select distinct from one’s children was sufficient. Thus it seemed it would just be a convenient abbreviation. If so, doubtless it will emerge, but it wouldn’t have added anything to the paper to suggest it. If the paper is in error on this point, doubtless many of you will let us know.

Issues:

Optimization:

We actually think that this syntax is as theoretically capable of optimization as any other language that produces such complex graphs. That is to say that it is tough, but all optimizers for complex deeply nested trees of correlated sub-queries are tough. However, the construction rules absolutely must not be able to affect the input data and any predicates in construction rules (such as <xsl:if select="..">) must be explicit and know to the optimizer so construction rules aren't extensible unlike aggregates and string comparison functions.

Ordering Relations:

We haven't yet fully addressed the importance of ordering of relations, but generally we want to say that within a given type of relation in the data model, the relation can be such that order matters. the obvious common case is marked up text which is lots of nodes related by "has" relations whose order matters profoundly. But we also think it will be common to point to a list of things through a relation such as "paper" in which the order matters.

Summary

This proposal was not written to put a final stamp on the query language for XML. Or really even a beginning. It was written to highlight the possibility that we can evolve from what we’re shipping today towards a much richer and more graceful query language tomorrow without necessarily having to make all XML users learn two things:

  1. One language for transforming XML into output
  2. Another language for transforming XML into XML

It was also written to try to make more concrete the very large list of work items that the W3C needs to address and to motivate the W3C to kick off a working group to start doing just that. It was written to highlight how much needs to be done yet in defining support for types, for extensibility, for graph construction operations, for schema, for data-models, and for a query language itself. We can do no less. XML’s time has come. It soon will be the ubiquitous way to move data around the net as many of us predicted almost 2 years ago. Now we must start defining ways to make sense out of this data.