Query languages for scientific data

Author

Peter Vanderbilt, pv@nas.nasa.gov, a member of Nasa Ames' Data Analysis group.

Abstract

The goal of this position paper is to provide some goals and requirements on XML query languages which will, in the author's opinion, make them more generally useful. While these requirements stem from thinking about queries languages for access to scientific data, it is believed that these requirements also apply to applications in general.

The key requirements are:

Support for opaque data.
Support for user-defined functions and predicates.
Support for higher-level (and user-defined) types.
Support for "objects".

These will be discussed in the following sections. The organization of this paper is:

Properties of Data Sources
Approach
Role of XML
Role of a Query Protocol
Properties of the Data Model
Other Suggestions
- Data transparency
- Queries as Data

1. Properties of Data Sources

It is difficult to generalize about scientific and engineering data because different scientific disciplines have very different data requirements. Let's just say that there are datasets and metadata. A dataset is obtained by experiment, observation or simulation. Examples of datasets include output from measuring devices, output from still and video cameras, text files created by humans, and output from simulations run on supercomputers.

Metadata is data associated with a dataset, typically identifying that dataset in various ways. Metadata can include parameters used in running an experiment or simulation. It can also include information derived from a dataset. The distinction between the dataset and its metadata is often not well-defined, either logically or in terms of physical data structures.

The way we view data sources is that they:

are heterogeneous, both in content and representation;
are created independently and incrementally;
must be capable of being federated;
may contain semistructured metadata;
will be distributed; and
will contain "big" data (in that the computation must be done where the data is).

Our key requirement is to have uniform access to all these kinds of data sources.

2. Approach

Our approach is as follows.

Define a common data model, parameterized by schema.
Map each data source into some instance of the data model.
Define a query language whose queries can be evaluated relative to an instance of the data model.
Define a protocol for carrying queries from a client to data source and for carrying query results back.

An instance of the data model is a value that represents the state of the data source (including all the metadata and datasets). The evaluation of a query against such an instance returns some subset of that instance, perhaps with reformatting or processing applied.

Note that a data model instance and the evaluation of a query against it are logical concepts, a way to abstractly describe the behavior of such a system. An actual implementation of a data source need not actually materialize the data model instance and then evaluate a query against it. Some implementations may transform a query to one supported by the underlying database. Other implementations may handle a query directly but materialize only those parts of the data source instance needed. Probably most implementations will use a combination of approaches.

3. Role of XML

Where does XML fit into this picture?

XML could be used as the data model: an instance of the data model would be an XML document.
XML could be used to represent query results: the result of a query would be an XML document.
XML could be used as a storage format at a data source.

In our opinion, the key role for XML is given by the second one, using XML to encode query results.

Because XML is gaining in popularity as an information transport mechanism, there is an ever-growing set of applications (and support software) that understand XML. By having queries return XML, we can leverage that software base.
There needs to be some standard representation for query results that can handle the wide range of results that different queries can produce. It appears that XML is an appropriate representation.

We believe that there are advantages, {l:#DataModelProps~discussed below}, to thinking of the data model at a higher level. We believe that the data model need not be strictly XML-based for the following reasons.

Applications don't interact with a data model instance directly, only through queries. Thus the data model need not be XML, only query results.
Because XML will be used to represent and transport data, we believe it is a requirement that XML documents can be mapped into the data model. Given this, queries can act on XML documents, alleviating the need for a strict XML model.

As to the issue of using XML as a storage format, we feel that this is an implementation issue for data sources.

4. Role of a Query Protocol

Before going further, we would like to call out one of our assumptions: that the mechanism for uniform access to data sources is a query protocol. A query protocol has

a standard representation for a query,
a standard representation for query results,
a description of how these elements are mapped on to some lower-level networking protocol.

A real protocol may carry other information with a query (such as authentication credentials, accounting information, user preferences, browser capabilities and version information) and would need to allow for exceptional returns.

We think that one useful protocol would be to use XML over HTTP. In general, a query would be carried in an XML wrapper using a (new) HTTP method, say "QUERY". An error return would be returned as an XML document describing the error. A regular return could come back in different formats, depending on Content-type. Some Content-types would indicate a general query result encoded in XML. Others would indicate results that are images, movies or whatever.

We believe that such a protocol should also have interoperability with regular web entities:

Regular web browsers can access data sources, for example by mapping HTTP's GET and POST methods to standard queries;
Clients (and virtual data sources) can access regular web entities by posing certain queries, ones that map to HTTP GET and POST requests.

5. Properties of the Data Model

We believe the data model should be higher-level than what is available in plain XML. In other words, a data model schema would contain more information than the corresponding DTD. In particular, we believe the data model should

be able to represent extended relational and object databases;
be able to represent XML documents (and semi-structured information in general);
support opaque data;
support user-defined operations;
support higher-level data types;
support "objects";
be self-identifying, in that it can return it's own schema.

The first requirement, that the data model support existing database technology, is to ensure that databases can be exported through the data model interface without loss of functionality. This implies that the data model needs to, in some way, be able to represent relational tables, objects, relationships, attributes, methods, joins and projections. For practical purposes, it should be straight-forward to map (appropriate subsets of) the query language into SQL and OQL.

The requirement regarding XML and semi-structured data recognizes that not all data has well-understood, regular structure.

The remaining points are each discussed below.

5.1. Support for Opaque Data

We believe that the data model should be able to support arbitrary kinds of data, including images, movies, scientific datasets, text documents and software. The following are examples of queries we think should be supported:

A query that returns a single data value in its native form (not wrapped in XML) with an appropriate MIME type. (The MIME type allows an appropriate viewer to be launched.)

A query that returns a collection of data values, possibly with associated metadata. Presumably the collection mechanism is represented in XML and, in order to carry data within XML, there needs to be some sort of transparency mechanism, as discussed below.
Queries like those above except that URLs are returned rather than data. Possibly some queries return queries instead of URLs. (It might be useful for the query result protocol to have a standard way to transparently return either data, a URL (referencing that data) or a query (which returns that data)).
A query that selects a set of images, applies a thumbnail-producing function to each, and returns the results in an appropriate table.
A query that selects an ordered collection of images and returns a movie formed from them.
A query that returns appropriate metadata for those images selected by some feature-detection predicate.

As discussed above, one should be able to return opaque data in a query. We also believe that one should also be able to pass opaque data as input to a query. This is especially useful when accessing two or more data sources.

5.2. Support for User-defined Operations

We believe a query system is much more powerful if it can incorporate user-defined (actually data-source-defined) functions and predicates. These could be used within a query to operate on results (like in a "select" clause) or as part of a search (like in a "where" clause).

The previous section gave some examples of functions, some of which might be user-defined. Other examples are:

functions that operate on DateTime strings, like comparing two DateTimes or subtracting them;
functions for searching certain XML subtrees;
functions for extracting and reformatting data from XML subtrees.

Note: in some cases, functions may produce non-XML results. Examples include the movie-producing or thumbnail-producing functions mentioned in the last section. There could also be functions that transform XML subtrees into HTML documents.

5.3. Support for Higher-level Data Types

We believe it is useful to incorporate into the data model typing above and beyond what XML supports. Specifically, we recommend considering

adding higher-level types for CDATA,
adding higher-level types for opaque data,
allowing elements to be treated as "objects", and
allowing element "choices" to be named.

Each of these will be discussed shortly. But first, a little motivation.

5.3.1. The role of types

A type supplies information about a value. For our purposes, saying that a value is of some type means that that value is of some specific abstract set and that its physical value is given by a particular representation (which can be viewed as a mapping from physical to abstract values). A type allows data to be interpreted at a higher-level, providing support for:

allowing operations to be more efficient (because they don't have to check the format of their arguments);
ensuring that queries are meaningful (because the type system ensures that operations are applied to appropriate arguments);
allowing queries to be more concise (because the type system can utilize polymorphism, drive the discrimination of overloaded functions and determine the need for coercions);
facilitating communication about the entities within a data source (since, by having types associated with entities declared in a schema, one can infer properties of those entities).

It might also be that type information can be used to associate with an XML document its DTD and certain PIs.

5.3.2. Adding higher-level types for CDATA

XML's universal type for everything other than elements is CDATA, which denotes "character data" (aka strings). The problem is, for example, that "9" is less than "10" as a number while "9" is greater than "10" as a string. So if a query contains a less-than operation applied to the two CDATA values, "9" and "10", the result is ambiguous.

By associating higher-level types with CDATA, this sort of ambiguity can be resolved. We suggest that the data model allow one to specify a higher-level type name, like "Int", "DateTime" or "URL", instead of PCDATA (in element declarations) and CDATA (in attribute declarations).

5.3.3. Adding higher-level types for opaque data

Similarly, one should be able to specify higher level types for opaque data to allow those things to be partitioned into appropriate classes.

It might be that the type of an opaque value determines the MIME type to be associated with such a value during transfer and might also determine transport encoding and/or compressions.

5.3.4. Allowing elements to be treated as "objects"

XML elements, since they are tagged by namespace-qualified names, are adequately typed. However, there is value to treating XML subtrees as "objects". In particular, a query, instead of operating on the content of an element, should be able to apply a method to the object and the method's implementation (as defined by the object's class) would operate on the content.

This recognizes that an XML element often represents some higher-level entity, such as a poem, a list of stock quotes, the parameters for a experiment, a query or a schema. By using the object paradigm, queries can be made more intuitive, operating in terms of methods defined for that class of objects.

Also, since a query operates at the method level, the actual state is encapsulated which means, among other things, that an object's state's representation can be mutated without forcing queries (or the schema) to change. And if there is an appropriate form of inheritance, there can be many different XML object classes that implement the same interface.

Presumably these objects, in addition to having methods, would have attributes (also called properties) and could be in relationships. There could be a standard method on objects, say copy(), that returns the object's state as XML. Objects would have identity, presumably represented by ID attributes.

Note that having an object model might also allow for more efficient mapping of lower-level object facilities, like those of OODBs, into the data model. In particular, XML objects could "front" for some lower-level entity and the method's implementation can actually operate at the more efficient, lower level.

5.3.5. Allowing element "choices" to be named

In our limited experience with XML, we have found that we end up using the same choice lists (element content lists separated by "|") over and over. We believe the data model should allow one to name these choices and the query language should allow these choice names to be used in path expressions.

For instance, one might create a type name, "publication", for books, articles and pamphlets. Then "publication" could be used in an element content description to mean the choice list "(book|article|pamphlet)" and "publication" could be used in a query's path expression to denote any of those element tags.

It is recognized that one could create "wrapping" elements to achieve the same thing, but we found that such wrappers made the resulting XML very verbose, especially when the DTD is highly recursive.

5.3.6. User-defined types

We believe it is impractical for there to be one fixed set of types both because there are too many of them and because the set is constantly growing. Thus data sources should be able to define their own types. There should also be a way for collections of data sources to share sets of types (so that each type means the same thing across the collection of data sources).

5.4. Self-Identifying Data Sources

We believe it is important for data sources to be able to describe themselves. In particular, a data source should be able to return its own schema as the result of some standard query. Presumably the schema is returned in XML, possibly annotated with a style sheet for viewing in regular web browsers. The schema would announce the types, functions, predicates and object classes supported by the data source as well as giving the type of each data value exported.

Since schemas may evolve, a version id should be associated with (and contained in) each instance of the schema. Then a version id can be sent with a query and an exception thrown if the schema has evolved in an incompatible way.

6. Other Suggestions

6.1. Data transparency

It is desirable to be able to carry arbitrary text transparently within an XML document. XML's CDATA section only allows text that doesn't contain the CDEnd string ("]]>"). What is needed is a simple reversible mapping that takes arbitrary text into text without CDEnd. Then to carry text data within XML, apply the mapping and wrap it in a CDATA section. To retrieve the data, extract it from the CDATA section and apply the reverse mapping. (It would be nice if this mapping also correctly handled end-of-line mapping and charset and character encoding issues).

It is also desirable to carry binary data transparently. One approach is to define a reversible mapping from binary data to text without CDEnd (like MIME's Base64) and encapsulate within CDATA. Another approach is to use some sort of multipart representation so that binary data can be carried in its native form over 8 bit transports. (In either case, it would be nice if the encoding could carry a MIME type with the data.)

6.2. Queries as Data

It is useful for queries to be data, so that

queries can be stored in documents (and web pages),
there can be query-returning queries,
queries can, subject to appropriate controls, be constructed on-the- fly and evaluated.