A Unified Approach for Querying Structured Data and XML

Serge Abiteboul	Jennifer Widom, Tirthankar Lahiri
serge.abiteboul@inria.fr	{widom,tlahiri}@db.stanford.edu

1. Position

XML is a syntax for exchanging data on the Web, and a query language for XML should provide convenient and expressive access to this data. For effective data exchange it is essential that a data model for XML and its associated query language are well-accepted standards with well-defined, clean semantics. The semantics should capture the present state of XML, should be extensible to adapt easily to future extensions, and should not ignore the years of research and development in data models and query languages.

First consider the data model. Data found on the Web presents a great deal of diversity, from loosely structured, irregular documents (e.g., HTML home pages) to very structured information (e.g., extracted from relational database management systems). Clearly, XML's underlying data model should permit very loosely structured data. However, we believe that the model should equally encompass well-structured data, since applications can take full advantage of rich data sources only if they understand the semantics of these sources.

Now consider the query language. The language should provide standard keyword-based pattern search (based on full-text indexing), as found in information retrieval and standard search engines on the Web. It should also include some form of navigation through XML tag structure. We go one step further and say that the language should provide declarative and expressive access to structured portions of the data in the standard database style, e.g., using SQL, OQL, or a variant thereof. A clear, formal semantics is also required. We believe that OQL makes an appropriate basis for an XML query language.

Therefore, instead of inventing yet another new data model and query language, we propose to start with the accepted de facto standard for object databases, the ODMG model and its query language OQL [Cat94]. What must be achieved is an extension of this model and language to handle loosely structured data (i.e., lack of typing), pattern search, and navigation (following tags or links). It should be noted that the object flavor of the model and language facilitates the task of specifying standard functionalities in the direction of DOM, XLL, etc.

An extension of ODMG along these lines has already been designed and implemented in the Ozone system that we now describe briefly. Full details on Ozone can be found in [LAW98].

2. Ozone

Ozone is implemented on top Ardent software's O₂system, an ODMG compliant object-oriented DBMS. An implementation based on an object-relational system and SQL3 is under consideration. Note that the description of Ozone in [LAW98] is based on the Object Exchange Model [AQM+97,PGMW95], a data model predating but very similar to XML. We describe Ozone here in terms of XML directly.

2.1 Data Model

The Ozone data model extends ODMG. ODMG has all the necessary features of object-orientation: classes encapsulating attributes and methods, subtyping, and multiple inheritance allowing complex class hierarchies. All values in an ODMG database must have a valid type defined in the schema. The possible types in ODMG are: atomic types (integer, float, string, etc.), structured types (records), collection types (sets, bags, lists, and arrays), and user-defined classes (with inheritance, methods, etc).

Ozone extends ODMG with a new class XML to represent untyped XML information. Objects in the class XML are of two categories: XMLcomplex and XMLatomic. (Note that these categories do not represent subclasses in the strict sense of the term, since XML objects are untyped.) An XMLcomplex object encapsulates (i) a list of (tag,value) pairs and (ii) a list of (attribute,value) pairs, where values are XML objects. Allowing structured, strongly-typed ODMG objects to have components of type XML provides "crossover points" from structured to unstructured information.

The value of an XMLatomic object may have any valid ODMG type. Thus, in addition the ODMG atomic types integer, float, string, etc., the value of an XMLatomic object may for example be of type Product (a class in the schema), list(tuple(a:integer, b:Product, c:XML)), etc. Intuitively, atomic XML objects are untyped containers for typed values. Note that an XMLatomic object containing an object of type XML is interpreted as a reference (IDREF). Allowing atomic XML objects to encapsulate arbitrary structured data provides "crossover points" from unstructured to structured information.

We must also consider how DTD's affect our data model and query language. We can certainly represent a document that is valid with respect to a particular DTD using our untyped XML class, but in that case the structural information provided by the DTD is lost. Alternatively, we can exploit the DTD to represent the data within the rich type system of ODMG, as described in [ACC97] (although to be precise we note that it is necessary to extend ODMG with type union in order to do so).

2.2 Query Language

The query language for the Ozone system is OQL-S. OQL-S is not a new query language--it is syntactically identical to OQL. Furthermore, the semantics of OQL-S on structured data is identical to OQL on standard ODMG data. However, OQL-S extends OQL with additional semantics (based largely on the Lorel query language for semistructured data [AQM+97]) that allow it to access unstructured XML data. Furthermore, the semantics of OQL-S allows smooth navigation from structured data to unstructured data and vice-versa, and provides intuitive evaluation of expressions that mix the two kinds of data. OQL-S provides navigation of unstructured data without the possibility of run-time errors, while providing full type-checking support for the structured portions of the data.

3. Conclusions

Our conclusions are twofold:

In designing a data model and query language for XML we should not ignore years of research and development in the database community. ODMG and OQL appear to be good candidates for XML.
An appropriate data model and query language should allow clean integration of structured data together with XML. Again, ODMG and OQL (as extended in, e.g., the Ozone system described above) are good candidates.

References

[ACC+97] S. Abiteboul, S. Cluet, V. Christophides, T.Milo, G. Moerkotte, and J. Simeon. Querying documents in object databases. International Journal on Digital Libraries, 1(1). pp. 5--19, 1997.

[AQM+97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener. The Lorel query language for semistructured data. International Journal on Digital Libraries, 1(1), pp. 68--88, 1997.

[Cat94] R.G.G. Cattell, editor. The Object Database Standard: ODMG-93. Morgan Kaufmann, San Francisco, California, 1994.

[LAW98] T. Lahiri, S. Abiteboul and J. Widom, Ozone: Integrating Structured and Semistructured Data, http://www-db.stanford.edu/pub/papers/ozone.ps

[PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object exchange across heterogeneous information sources. In Proceedings of the Eleventh International Conference on Data Engineering, pp. 251--260, 1995.