XML-GL: A Graphical Language for Querying
and Reshaping XML Documents

Stefano Ceri, Sara Comai, Ernesto Damiani, Piero Fraternali, Stefano Paraboschi, Letizia Tanca
Email: ceri,comai,fraterna,parabosc,tanca@elet.polimi.it, edamiani@crema.unimi.it
Politecnico di Milano, Dipartimento di Elettronica e Informazione
Università di Milano, Polo di Crema

Abstract:

We present XML-GL, a graphical query language for XML documents. XML-GL derives from GLog [1,2] a general purpose, logic- and graph-based language for querying structured and semi-structured data. Here we list a number of interesting requirements for a query language for XML documents, and give a few examples of XML-GL features addressing such requirements. 1

The Requirements for an XML Query Language

  XML is a recent recommendation of the World Wide Web Consortium for a meta-language to define markups for content publishing on the Web. Derived from SGML, XML allows document producers to define and use the set of tags that best mirrors the structure and conceptual properties of the information they want to publish.

The use of XML will bring a major change in the structure of Web information, which will become more and more a collection of semistructured objects, i.e., pieces of content for which at least a partial representation of structure (known as schema in the database lexicon) is available. This evolution brings forth the necessity of novel languages for extracting information from XML content, much in the same way as traditional query languages (notably SQL) have been used for extracting information from structured data, e.g., relational database.

We believe that the definition of a query language for XML must be associated with the study and production of representation tools for metadata. Indeed, an XML document can be compliant to a Document Type Definition (DTD), that specifies the types of markup elements that can appear in the document, their attributes and their containment relationships. If an XML document adheres to a DTD, it is said to be valid. On the other hand, XML documents need not be valid; if they lack a DTD but respect some syntactic rules for tag placement they are said to be well-formed. The notion of DTD in XML well lends itself to be interpreted as a schema. Schema availability is very useful for query formulation as well as for query optimization; thus, we will list in the sequel requirements both for a query language and for a metadata representation language.

1.
The language should be flexible enough to allow query formulation both for valid and well-formed documents. Availability of a DTD should result in a facilitation of both expressing and evaluating a query in XML-GL.
2.
We want to address queries to a whole WEB site, taking into account links between different documents, rather than querying only one document at a time.
3.
We want the possibility to access metadata as well as data, by using the same query paradigm.
4.
We want to support the extraction of an XML document from a query specific subsets or rearrangement of the information present in the source XML document, by means of powerful pattern matching and string manipulation operations.
5.
We want to be able to reshape the source XML documents, by specifying new links between existing elements as well as new elements in the XML document resulting from a query. This last process requires the introduction of new tags in XML.
6.
We want to specify regular expressions on paths, i.e. the possibility of following recursively arbitrarily long paths in a site, possibly specifying conditions on path nodes. This feature is particularly helpful in the case the search is directed towards certain document patterns, that may be placed arbitrarily in the XML source.
7.
We also want to support arbitrary computations on the numeric content of documents by means of built-in functions.
8.
Given that XML is positional, we may want to allow an ordered query interpretation, i.e. one in which the positions of XML tags, items and data is meaningful. However, such an interpretation may be overly restrictive, so the query language should also offer an unordered interpretation (probably as default).
9.
In addition, under given query interpretations, XML-GL should be able to extract similar documents to the ones that fully satisfy the queries.
10.
Finally, we want our language to be readable, and to allow an easy reading of the metadata, i.e., an easy interpretation of the DTD.

The specific novelty of XML-GL with respect to other proposals for querying XML documents, like XML-QL and XSL, is that queries are formulated visually using a graph-based formalism close to the structure of XML documents (e.g., comparable to the visual representations of XML documents offered by XML authoring tools like the Near & Far XML editor). However, XML-GL is not a visual interface over a conventional, textual query language, but a graph-based query language with both its syntax and semantics defined in terms of graph structures and operations.

XML-GL

  We introduce a data model for XML documents, called XML-GDM (XML Graphical Data Model), which we use to represent both the expected structure of XML documents (i.e., their DTDs) and actual documents. The data model has a graphical representation: a syntax directed translation produces graphical schemas of DTDs or of documents. We envision a typical use of XML-GL in which users are presented with the graphical representation of the DTD and can produce queries in XML-GL with a WYSIWYG interface, with suitable drag and fill commands.

Well-formed documents without a DTD can be queried in XML-GL in the same way as valid documents, thus addressing our first requirement. However, since no information about the structure of the document is available at query formulation time, querying of documents without DTD might require flexible interpretations of results. According to our second and third requirements, an XML query is addressed to all the documents in a site, and can be used for querying the DTD.

To present XML-GDM, we introduce an example of a DTD and an XML document conforming to it. Figure 1 represents a DTD which specifies the structure of documents containing book orders, together with the corresponding XML-GDM schema. A possible XML document conforming to this DTD is shown in Figure 2.


 
<!ELEMENT order (shipto, contact?, item+, date)> 
<!ATTLIST order number PCDATA #REQUIRED> 
<!ELEMENT shipto (fulladdress|reference)> 
<!ELEMENT contact (reference|PCDATA)> 
<!ELEMENT fulladdress (company, addressline+, city)> 
<!ELEMENT reference EMPTY> 
<!ATTLIST reference customer IDREF> 
<!ELEMENT person (firstname?,lastname,fulladdress)> 
<!ATTLIST person id ID> 
<!ELEMENT company PCDATA> 
<!ELEMENT addressline PCDATA> 
<!ELEMENT city PCDATA> 
<!ELEMENT date (day, month, year)> 
<!ELEMENT day PCDATA> 
<!ELEMENT month PCDATA> 
<!ELEMENT year PCDATA> 
<!ELEMENT item (book, quantity, discount?)> 
<!ELEMENT book (isbn,title?,price)> 
<!ELEMENT isbn PCDATA> 
<!ELEMENT title PCDATA> 
<!ELEMENT price PCDATA> 
<!ELEMENT quantity PCDATA> 
<!ELEMENT discount PCDATA>

  
Figure 1: DTD and XML-GL schema of the Running Example

 
<?xml version="1.0" standalone="no" encoding="UTF-8"?> 
<DOCTYPE ORDER SYSTEM "order.dtd"> 
 
<ORDER> 
 <SHIPTO><REFERENCE customer="C00001"></REFERENCE></SHIPTO> 
 <CONTACT>Tim Bell</CONTACT> 
 <DATE><DAY>14</DAY><MONTH>11</MONTH><YEAR>1998</YEAR></DATE>
 <ITEM> 
  <BOOK><ISBN>15536455</ISBN><TITLE>Introduction to XML</TITLE> 
   <PRICE>24.95</PRICE> 
  </BOOK> 
  <QUANTITY>6</QUANTITY> 
  <DISCOUNT>.40</DISCOUNT> 
 </ITEM> 
 <ITEM> 
  <BOOK><ISBN>15532155</ISBN><TITLE>Introduction to Internet</TITLE> 
   <PRICE>22.50</PRICE> 
  </BOOK> 
  <QUANTITY>10</QUANTITY> 
  <DISCOUNT>.42</DISCOUNT> 
 </ITEM> 
</ORDER> 
 
<ORDER> 
 <SHIPTO> 
  <FULLADDRESS><COMPANY>ASA</COMPANY> 
   <ADDRESSLINE>18 Harvard str.</ADDRESSLINE> 
   <CITY>Los Angeles</CITY> 
  </FULLADDRESS> 
 </SHIPTO> 
 <CONTACT><REFERENCE customer="C00002"></REFERENCE></CONTACT> 
 <DATE><DAY>20</DAY><MONTH>11</MONTH><YEAR>1998</YEAR></DATE>
 <ITEM> 
  <BOOK><ISBN>15536455</ISBN><TITLE>Introduction to XML</TITLE> 
   <PRICE>24.95</PRICE> 
  </BOOK> 
  <QUANTITY>6</QUANTITY> 
  <DISCOUNT>.40</DISCOUNT> 
 </ITEM>  
 <ITEM> 
  <BOOK><ISBN>15532155</ISBN><TITLE>Introduction to Internet</TITLE> 
   <PRICE>22.50</PRICE> 
  </BOOK> 
  <QUANTITY>10</QUANTITY> 
  <DISCOUNT>.42</DISCOUNT> 
 </ITEM> 
</ORDER> 
 
<PERSON id="C00001"> 
<LASTNAME>Moore</LASTNAME><FULLADDRESS><COMPANY>ABC</COMPANY> 
<ADDRESSLINE>10 Michigan str.</ADDRESSLINE><CITY>Los Angeles</CITY></FULLADDRESS> 
</PERSON> 
<PERSON id="C00002"> 
<LASTNAME>Andrews</LASTNAME><FULLADDRESS><COMPANY>ASA</COMPANY><ADDRESSLINE>18 Harvard 
str.</ADDRESSLINE><CITY>Los Angeles</CITY></FULLADDRESS> 
</PERSON>



   Figure 2: Document of the Running Example

An XML-GL query can be applied either to a single XML document or to a set of documents, e.g., those composing a WWW site. The query produces a new XML document as the result. Thus, the application of a query results in a transformation of the source XML document(s) into a new XML document.


An XML-GL query consists of four parts:

1.
The extract part identifies the target of the query, by indicating the target documents and the target elements inside these documents; by drawing a parallel with SQL, the extract part can be seen as the counterpart of the from clause, which establishes the relations targeted by the query.
2.
The match part (optional) specifies logical conditions that the target elements must satisfy in order to be part of the query result; continuing the parallel with SQL, the condition part can be seen as the counterpart of the where clause, which chooses the target tuples part of the result.
3.
The construct part specifies the structure of the result document; the same query can be formulated with different contruction parts, to obtain results formatted differently. With respect to SQL, the construct part can be seen as the extension of the create view statement, which permits the user to design a new relation from the result of a query. The construct part allows both the creation of new elements, the definition of new links, and the restructuring of information local to a given element, thus satisfying the fourth and fifth requirements.
4.
The clip part (optional) specifies the elements from the target documents to be retained in the constructed result. If no clipping is specified, elements of the target documents that are part of the result are considered together with all the subelements they may recursively contain. With respect to SQL, the clip part can be seen as the extension of the select clause, which permits the user to define which columns of the result tuples should be retained in the final output of the query.
Note that currently XML-GL is only a query language. Thus, in XML-GL we are now defining new documents and not modifying existing ones.

Some examples illustrate the main features of XML-GL; the first one shows the simplest form of XML-GL query: the Match-Construct.

Select all the person elements from document orders.xml.


  
Figure 3: Example of match-construct query.

The query is represented in Figure 3. Assuming that orders.xml is the document of Figure 2, the query result is:

 
<RESULT> 
  <PERSON id="C00001"> 
    <LASTNAME>Moore</LASTNAME> 
    <FULLADDRESS> 
      <COMPANY>ABC</COMPANY> 
      <ADDRESSLINE>10 Michigan str.</ADDRESSLINE> 
      <CITY>Los Angeles</CITY> 
    </FULLADDRESS> 
  </PERSON> 
 
  <PERSON id="C00002"> 
    <LASTNAME>Andrews</LASTNAME> 
    <FULLADDRESS> 
      <COMPANY>ASA</COMPANY> 
      <ADDRESSLINE>18 Harvard str.</ADDRESSLINE> 
      <CITY>Los Angeles</CITY> 
    </FULLADDRESS> 
  </PERSON> 
</RESULT>

The query is a pair of related graphs. The graph on the left describes the extract and match components; in this simple example, this graph includes the target object of the query (element person in this case), which may optionally contain the indication of the document or set of documents to which the query is targeted. Target documents may be identified using URLs with wildcards, so that the string ``http://www.elet.polimi.it/ceri/ord*.xml''inside the person box would target the query to all .xml documents of the ceri directory whose name starts with the string ``ord''. Without this identification, the query target is the whole site.

The graph on the right describes the construct and clip parts. In the case of Figure 3, the triangle denotes the simplest form of document construction: all elements found by the match part are inserted one after the other into the predefined result tag.

A more complex query follows:

Select all orders not containing any item priced more than 25$ and shipped to an address in Los Angeles (Figure 4).


  
Figure 4: Example of a more complex query.

The query result is the following:

 
<RESULT> 
 <ORDER> 
  <SHIPTO> 
   <FULLADDRESS><COMPANY>ASA</COMPANY> 
    <ADDRESSLINE>18 Harvard str.</ADDRESSLINE> 
    <CITY>Los Angeles</CITY> 
   </FULLADDRESS> 
  </SHIPTO> 
  <CONTACT><REFERENCE customer="C00002"></REFERENCE></CONTACT> 
  <DATE><DAY>20</DAY><MONTH>11</MONTH><YEAR>1998</YEAR></DATE>
  <ITEM> 
   <BOOK><ISBN>15536455</ISBN><TITLE>Introduction to XML</TITLE> 
    <PRICE>24.95</PRICE> 
   </BOOK> 
   <QUANTITY>6</QUANTITY> 
   <DISCOUNT>.40</DISCOUNT> 
  </ITEM>  
  <ITEM> 
   <BOOK><ISBN>15532155</ISBN><TITLE>Introduction to Internet</TITLE> 
    <PRICE>22.50</PRICE> 
   </BOOK> 
   <QUANTITY>10</QUANTITY> 
   <DISCOUNT>.42</DISCOUNT> 
  </ITEM> 
 </ORDER> 
</RESULT>

Besides the basic features shown by these examples, XML-GL addresses the remaining requirements by allowing:

Finally, the examples shown already demonstrate that readability is a major feature of XML-GL, since a graph-based formalism is well suited to the representation of hypermedia information.

References

1
Jan Paredaens, Peter Peelman, Letizia Tanca: G-Log: A Graph-Based Query Language. TKDE 7(3): 436-453 (1995)

2
Jan Paredaens, Peter Peelman, Letizia Tanca: Merging Graph-Based and Rule-Based Computation: The Language G-Log. DKE 25(3): 267-300 (1998)


Footnotes

...requirements.
A specification of XML-GL will be available at the WEB site of the authors by the time of the workshop.