A query language is essential, if XML is to serve effectively as an exchange medium for large data sets. The design of query languages for XML is in its infancy, and the choice of a standard may be governed more by user acceptance than by any understanding of underlying principles. One would hope that expressive power, performance, and compatibility with other languages will be considered in choosing among alternatives, but it is likely that several contenders will co-exist for some time.
It is worth observing that, during the 20-year development of relational query languages, several competing languages were developed; and even today there are several relational query language standards. In spite of this, a great deal of technology was developed that was independent of the surface syntax of a query language. This included technology ``below'' the language such as efficient execution models and work ``above'' the level of language - such as techniques for view definition and maintenance, triggers, etc. At Penn we are working on some of these language-independent issues. We include a summary of them here. They include execution and data models to support XML and semistructured query languages; the use of schemas and constraints in optimizing XML query languages; and tools for extracting data form existing sources and presenting it as XML.
Data extraction, transformation, and integration are all well-understood database problems. Their solution relies on a query language, either relational (SQL) or object-oriented (OQL). These query languages do not apply immediately to XML, because the XML data differs from traditional relational or object-oriented data. XML data, however, is very similar to a data model recently studied in the research community: the semistructured data model. Several research query languages have been designed and implemented for semistructured data [6,10,2]. XML-QL [9] applies the experience accumulated in the design and implementation of these languages to query XML documents using pattern matching.
There are differences between the rather general model used in semi-structured data, and the XML model. While this may not affect the surface syntax of the query language, it will certainly affect the semantics of query languages and the optimizations that are possible. One of the research areas at Penn is the investigation of new models for semistructured data that may be more appropriate to XML. In particular, the principle that paths uniquely identify (nested) elements is not supported by current models for semi-structured data, and an alternative model is needed.
Among the various proposals for structuring or adding semantics to XML, a number advocate the need for constraints [5,12,11,1] A class of integrity constraints, called path constraints, has been introduced and studied in [7]. These constraints are capable of expressing natural integrity constraints that are not only a fundamental part of the semantics of the data, but are also important in query optimization. In particular, they are useful for specifying and querying XML documents.
For example, let us assume that we have a set of projects (Proj) and a set of departments (Dept). Each department is involved in one or more projects (element DProj); each project is managed by one unique department (element PDept). Eventhough the cyclicity of element nesting requires the use of IDREF attributes, DProj and PDept will be treated as nested element tags in the following examples. The following are path constraints, which describe a typical inclusion constraint and an inverse relationship between projects DProj and departments PDept.
(INCL): <constraint>
<inclusion path = "Dept.DProj" memberOf ="Proj" />
</constraint>
(INV): <constraint>
<prefix path = "Dept" />
<inverse path = "DProj" inverseOf = "PDept"/>
</constraint>
These constraints state that:
For each <
Dept>
element, and each of its <
DProj>
element dp, dp is also a <
Proj>
element.
For each <
Dept>
element d and each of its project elements
<
DProj>
p, p is a <
Proj>
element such that its <
PDept>
element is d.
where <Dept> <DProj>$dp</> <DName>$dn</> </Dept> <- DB, <Proj>$dp</> <- DB, <Budget>$b</> <- $dp, <PName>$pn</> <- $dp, $b > 10000 collect <PName>$pn</> <DName>$dn</A> |
where <Dept> <DName>$dn</A> <DProj> <Budget>$b</> <PName>$pn</> </DProj> </Dept> <- DB, $b > 10000 collect <PName>$pn</> <DName>$dn</A> |
where <Proj> <Budget>$b</> <PName>$pn</> <PDept.DName>$dn</> </Proj> <- DB, $b > 10000 collect <PName>$pn</> <DName>$dn</A> |
Original query Q | Q' (=Q with INCL) | Q'' (=Q qith INV) |
To see how these constraints help in query optimization, consider the XML-QL [9] queries in figure 1 (in XML-QL, IDREF attributes are dereferenced implicitely and treated as nested element tags).
Query Q, returning the names of projects with budget over 10000 and their corresponding departments, performs two iterations over the database, but its evaluation can be optimized to execute just a single iteration over the database: query Q' (equivalent to Q due to INCL) iterates over Dept entries, while query Q'' (equivalent to Q' due to INV) iterates over Proj entries.
Updating documents is currently done in a very different fashion to database updates. Structural editors guarantee some integrity of the document structure, but they do not ensure any kind of consistency with schemas or type descriptions [13]. Just as we need languages for querying XML documents, we need languages for updating them.
Typically, information sources on the Internet are shared with many other users. Often, users keep local copies of documents to make querying more efficient. In this case, changes to the original information source needs to be propagated in an efficient way. Downloading the entire documents from the source - whether they were updated or not , is often quite inefficient. Large scientific information sources, such as genetic databases (e.g. Swissprot and GenBank [4,8]), therefore provide information about what changes have been made to the database in a given time span. With availability of large data sets specified in XML, there will be a greater demand for a generic language for specifying updates for XML data.
Another important issue related to data evolution is to store information about the changes made in the information source. The time, date, the author, and the motivation of the update are often relevant information stored.
For example, in public genetic databases, such as Swissprot or GenBank [4,8], DNA sequences or protein information are changed frequently, since more structural information becomes available over time. Furthermore, DNA sequences are annotated with other, additional information that is obtained by complex statistical data analysis.
XML already provides a syntax for annotating data. We are also interested in a solution for annotating changes to data. This is in the same spirit as ``mend'' annotation proposed for XML[15]. We seek a solution that is simple and sufficiently general to capture changes in semi-structured data as well as conventional databases so that one could answer questions such as: "Where did this data item come from?", "How did it evolve to this state?". This would require a system of annotations that is capable of describing data at different levels of granularity. One reason for maintaining such information is that corrections and errors may be traced back to the originator in an straight-forward fashion.
A larger problem is to be able to extract structure from HTML documents and export it into other formats. This is extremely useful for data inter-operation (between Web sources and legacy databases or among Web sources themselves), automation of Web information processing and data migration. From a database perspective, it means to build wrappers for Web sources.
The WysiWyg Web Wrapper Factory developed at Penn is an attempt to offer a toolkit to build such Web wrappers. In our case, extracting information for HTML documents consists of: (1) building a parse tree (according to the Document Object Model like in [3,16])) and (2) evaluating extraction rules. Extraction rules consist of path expressions [2] along the tree and are expressed using the HTML Extraction Language (HEL).
The language comes with some powerful operators (navigation operators, regular expressions, conditions) to reach information at the tag level and deeper, and permits to return some nested structures in order to truly capture the information expressed in the document. As an illustration, should we want to return the list of pairs (url , name) for hyperlinks pointing to a .edu domain in a Web document, we could write the following rule (see [14] for details):
links = html->a[i:*] ( .txt # .getAttr(href) ) where html->a[i:*].getAttr(href) =~ "http://[^/]+.edu";The extracted structure here is a list of pairs that can be mapped into any user-defined data-structure: relational database, XML, etc. The major difference between HEL and [3,16] is that we can extract arbitrarily complex structures from the Web document, and not just strings.
To facilitate the writing of extraction rules we also offer a
Wysiwyg interface that takes a Web document and add some invisible
annotations: for each piece of text, the corresponding path is encoded
in a new tag. The document is returned to the user who can click from
his browser on any piece of information and get magically the
corresponding extraction rule.
The HEL language and this visual approach can also be applied to XML
documents.
We have been using successfully the toolkit to build a large range of wrappers for diverse Web resources like knowledge repositories, Web catalogues and databases. Most of the time, we came up with a robust up-and-running wrapper within minutes thanks to our wizards. The maintenance has also been quite easy. We have been using the toolkit together with XML-QL (by exporting XML structures) to have a powerful query language for Web sources.
There are a number of issues surrounding query languages for XML that are just as important as the query language itself. We have described some of them: the semantics of the underlying semistructured data model, the basic operations on data, the interaction of these operations with contraints, the nature of updates, and the problems of generating XML efficiently from existing sources.
We would like to thank our colleagues in the database group at Penn for their input.