Query languages for scientific data

Author

Peter Vanderbilt, pv@nas.nasa.gov, a member of Nasa Ames' Data Analysis group.

Abstract

The goal of this position paper is to provide some goals and requirements on XML query languages which will, in the author's opinion, make them more generally useful. While these requirements stem from thinking about queries languages for access to scientific data, it is believed that these requirements also apply to applications in general.

The key requirements are:

These will be discussed in the following sections. The organization of this paper is:

1.  Properties of Data Sources

It is difficult to generalize about scientific and engineering data because different scientific disciplines have very different data requirements. Let's just say that there are datasets and metadata. A dataset is obtained by experiment, observation or simulation. Examples of datasets include output from measuring devices, output from still and video cameras, text files created by humans, and output from simulations run on supercomputers.

Metadata is data associated with a dataset, typically identifying that dataset in various ways. Metadata can include parameters used in running an experiment or simulation. It can also include information derived from a dataset. The distinction between the dataset and its metadata is often not well-defined, either logically or in terms of physical data structures.

The way we view data sources is that they:

Our key requirement is to have uniform access to all these kinds of data sources.

2.  Approach

Our approach is as follows.

An instance of the data model is a value that represents the state of the data source (including all the metadata and datasets). The evaluation of a query against such an instance returns some subset of that instance, perhaps with reformatting or processing applied.

Note that a data model instance and the evaluation of a query against it are logical concepts, a way to abstractly describe the behavior of such a system. An actual implementation of a data source need not actually materialize the data model instance and then evaluate a query against it. Some implementations may transform a query to one supported by the underlying database. Other implementations may handle a query directly but materialize only those parts of the data source instance needed. Probably most implementations will use a combination of approaches.

3.  Role of XML

Where does XML fit into this picture?

In our opinion, the key role for XML is given by the second one, using XML to encode query results.

We believe that there are advantages, {l:#DataModelProps~discussed below}, to thinking of the data model at a higher level. We believe that the data model need not be strictly XML-based for the following reasons.

As to the issue of using XML as a storage format, we feel that this is an implementation issue for data sources.

4.  Role of a Query Protocol

Before going further, we would like to call out one of our assumptions: that the mechanism for uniform access to data sources is a query protocol. A query protocol has

A real protocol may carry other information with a query (such as authentication credentials, accounting information, user preferences, browser capabilities and version information) and would need to allow for exceptional returns.

We think that one useful protocol would be to use XML over HTTP. In general, a query would be carried in an XML wrapper using a (new) HTTP method, say "QUERY". An error return would be returned as an XML document describing the error. A regular return could come back in different formats, depending on Content-type. Some Content-types would indicate a general query result encoded in XML. Others would indicate results that are images, movies or whatever.

We believe that such a protocol should also have interoperability with regular web entities:

5.  Properties of the Data Model

We believe the data model should be higher-level than what is available in plain XML. In other words, a data model schema would contain more information than the corresponding DTD. In particular, we believe the data model should

The first requirement, that the data model support existing database technology, is to ensure that databases can be exported through the data model interface without loss of functionality. This implies that the data model needs to, in some way, be able to represent relational tables, objects, relationships, attributes, methods, joins and projections. For practical purposes, it should be straight-forward to map (appropriate subsets of) the query language into SQL and OQL.

The requirement regarding XML and semi-structured data recognizes that not all data has well-understood, regular structure.

The remaining points are each discussed below.

5.1.  Support for Opaque Data

We believe that the data model should be able to support arbitrary kinds of data, including images, movies, scientific datasets, text documents and software. The following are examples of queries we think should be supported:

As discussed above, one should be able to return opaque data in a query. We also believe that one should also be able to pass opaque data as input to a query. This is especially useful when accessing two or more data sources.

5.2.  Support for User-defined Operations

We believe a query system is much more powerful if it can incorporate user-defined (actually data-source-defined) functions and predicates. These could be used within a query to operate on results (like in a "select" clause) or as part of a search (like in a "where" clause).

The previous section gave some examples of functions, some of which might be user-defined. Other examples are:

Note: in some cases, functions may produce non-XML results. Examples include the movie-producing or thumbnail-producing functions mentioned in the last section. There could also be functions that transform XML subtrees into HTML documents.

5.3.  Support for Higher-level Data Types

We believe it is useful to incorporate into the data model typing above and beyond what XML supports. Specifically, we recommend considering

Each of these will be discussed shortly. But first, a little motivation.

5.3.1.  The role of types

A type supplies information about a value. For our purposes, saying that a value is of some type means that that value is of some specific abstract set and that its physical value is given by a particular representation (which can be viewed as a mapping from physical to abstract values). A type allows data to be interpreted at a higher-level, providing support for:

It might also be that type information can be used to associate with an XML document its DTD and certain PIs.

5.3.2.  Adding higher-level types for CDATA

XML's universal type for everything other than elements is CDATA, which denotes "character data" (aka strings). The problem is, for example, that "9" is less than "10" as a number while "9" is greater than "10" as a string. So if a query contains a less-than operation applied to the two CDATA values, "9" and "10", the result is ambiguous.

By associating higher-level types with CDATA, this sort of ambiguity can be resolved. We suggest that the data model allow one to specify a higher-level type name, like "Int", "DateTime" or "URL", instead of PCDATA (in element declarations) and CDATA (in attribute declarations).

5.3.3.  Adding higher-level types for opaque data

Similarly, one should be able to specify higher level types for opaque data to allow those things to be partitioned into appropriate classes.

It might be that the type of an opaque value determines the MIME type to be associated with such a value during transfer and might also determine transport encoding and/or compressions.

5.3.4.  Allowing elements to be treated as "objects"

XML elements, since they are tagged by namespace-qualified names, are adequately typed. However, there is value to treating XML subtrees as "objects". In particular, a query, instead of operating on the content of an element, should be able to apply a method to the object and the method's implementation (as defined by the object's class) would operate on the content.

This recognizes that an XML element often represents some higher-level entity, such as a poem, a list of stock quotes, the parameters for a experiment, a query or a schema. By using the object paradigm, queries can be made more intuitive, operating in terms of methods defined for that class of objects.

Also, since a query operates at the method level, the actual state is encapsulated which means, among other things, that an object's state's representation can be mutated without forcing queries (or the schema) to change. And if there is an appropriate form of inheritance, there can be many different XML object classes that implement the same interface.

Presumably these objects, in addition to having methods, would have attributes (also called properties) and could be in relationships. There could be a standard method on objects, say copy(), that returns the object's state as XML. Objects would have identity, presumably represented by ID attributes.

Note that having an object model might also allow for more efficient mapping of lower-level object facilities, like those of OODBs, into the data model. In particular, XML objects could "front" for some lower-level entity and the method's implementation can actually operate at the more efficient, lower level.

5.3.5.  Allowing element "choices" to be named

In our limited experience with XML, we have found that we end up using the same choice lists (element content lists separated by "|") over and over. We believe the data model should allow one to name these choices and the query language should allow these choice names to be used in path expressions.

For instance, one might create a type name, "publication", for books, articles and pamphlets. Then "publication" could be used in an element content description to mean the choice list "(book|article|pamphlet)" and "publication" could be used in a query's path expression to denote any of those element tags.

It is recognized that one could create "wrapping" elements to achieve the same thing, but we found that such wrappers made the resulting XML very verbose, especially when the DTD is highly recursive.

5.3.6.  User-defined types

We believe it is impractical for there to be one fixed set of types both because there are too many of them and because the set is constantly growing. Thus data sources should be able to define their own types. There should also be a way for collections of data sources to share sets of types (so that each type means the same thing across the collection of data sources).

5.4.  Self-Identifying Data Sources

We believe it is important for data sources to be able to describe themselves. In particular, a data source should be able to return its own schema as the result of some standard query. Presumably the schema is returned in XML, possibly annotated with a style sheet for viewing in regular web browsers. The schema would announce the types, functions, predicates and object classes supported by the data source as well as giving the type of each data value exported.

Since schemas may evolve, a version id should be associated with (and contained in) each instance of the schema. Then a version id can be sent with a query and an exception thrown if the schema has evolved in an incompatible way.

6.  Other Suggestions

6.1.  Data transparency

It is desirable to be able to carry arbitrary text transparently within an XML document. XML's CDATA section only allows text that doesn't contain the CDEnd string ("]]>"). What is needed is a simple reversible mapping that takes arbitrary text into text without CDEnd. Then to carry text data within XML, apply the mapping and wrap it in a CDATA section. To retrieve the data, extract it from the CDATA section and apply the reverse mapping. (It would be nice if this mapping also correctly handled end-of-line mapping and charset and character encoding issues).

It is also desirable to carry binary data transparently. One approach is to define a reversible mapping from binary data to text without CDEnd (like MIME's Base64) and encapsulate within CDATA. Another approach is to use some sort of multipart representation so that binary data can be carried in its native form over 8 bit transports. (In either case, it would be nice if the encoding could carry a MIME type with the data.)

6.2.  Queries as Data

It is useful for queries to be data, so that