Peter Vanderbilt, pv@nas.nasa.gov, a member of Nasa Ames' Data Analysis group.
Abstract
The goal of this position paper is to provide some goals and requirements on XML query languages which will, in the author's opinion, make them more generally useful. While these requirements stem from thinking about queries languages for access to scientific data, it is believed that these requirements also apply to applications in general.
The key requirements are:
These will be discussed in the following sections. The organization of this paper is:
It is difficult to generalize about scientific and engineering data
because different scientific disciplines have very different data
requirements. Let's just say that there are datasets and
metadata. A dataset is obtained by experiment, observation or
simulation. Examples of datasets include output from measuring
devices, output from still and video cameras, text files created by
humans, and output from simulations run on supercomputers.
Metadata is data associated with a dataset, typically identifying that
dataset in various ways. Metadata can include parameters used in
running an experiment or simulation. It can also include information
derived from a dataset. The distinction between the dataset and its
metadata is often not well-defined, either logically or in terms of
physical data structures.
The way we view data sources is that they:
Our key requirement is to have uniform access to all these kinds of
data sources.
Our approach is as follows.
An instance of the data model is a value that represents the state of
the data source (including all the metadata and datasets). The
evaluation of a query against such an instance returns some subset of
that instance, perhaps with reformatting or processing applied.
Note that a data model instance and the evaluation of a query against
it are logical concepts, a way to abstractly describe the behavior of
such a system. An actual implementation of a data source need not
actually materialize the data model instance and then evaluate a query
against it. Some implementations may transform a query to one
supported by the underlying database. Other implementations may
handle a query directly but materialize only those parts of the data
source instance needed. Probably most implementations will use a
combination of approaches.
Where does XML fit into this picture?
In our opinion, the key role for XML is given by the second one, using
XML to encode query results.
We believe that there are advantages, {l:#DataModelProps~discussed
below}, to thinking of the data model at a higher level. We believe
that the data model need not be strictly XML-based for the following
reasons.
As to the issue of using XML as a storage format, we feel that this is
an implementation issue for data sources.
Before going further, we would like to call out one of our
assumptions: that the mechanism for uniform access to data sources is
a query protocol. A query protocol has
A real protocol may carry other information with a query (such as
authentication credentials, accounting information, user preferences,
browser capabilities and version information) and would need to allow
for exceptional returns.
We think that one useful protocol would be to use XML over HTTP. In
general, a query would be carried in an XML wrapper using a (new) HTTP
method, say "QUERY". An error return would be returned as an XML
document describing the error. A regular return could come back in
different formats, depending on Content-type. Some Content-types
would indicate a general query result encoded in XML. Others would
indicate results that are images, movies or whatever.
We believe that such a protocol should also have interoperability with
regular web entities:
We believe the data model should be higher-level than what is
available in plain XML. In other words, a data model schema would
contain more information than the corresponding DTD. In particular,
we believe the data model should
The first requirement, that the data model support existing database
technology, is to ensure that databases can be exported through the
data model interface without loss of functionality. This implies that
the data model needs to, in some way, be able to represent relational
tables, objects, relationships, attributes, methods, joins and
projections. For practical purposes, it should be straight-forward to
map (appropriate subsets of) the query language into SQL and OQL.
The requirement regarding XML and semi-structured data recognizes that
not all data has well-understood, regular structure.
The remaining points are each discussed below.
We believe that the data model should be able to support arbitrary
kinds of data, including images, movies, scientific datasets, text
documents and software. The following are examples of queries we
think should be supported:
As discussed above, one should be able to return opaque data in a
query. We also believe that one should also be able to pass opaque
data as input to a query. This is especially useful when accessing
two or more data sources.
We believe a query system is much more powerful if it can incorporate
user-defined (actually data-source-defined) functions and predicates.
These could be used within a query to operate on results (like in a
"select" clause) or as part of a search (like in a "where" clause).
The previous section gave some examples of functions, some of which
might be user-defined. Other examples are:
Note: in some cases, functions may produce non-XML results. Examples
include the movie-producing or thumbnail-producing functions mentioned
in the last section. There could also be functions that transform XML
subtrees into HTML documents.
We believe it is useful to incorporate into the data model typing
above and beyond what XML supports. Specifically, we recommend
considering
Each of these will be discussed shortly. But first, a little
motivation.
A type supplies information about a value. For our purposes, saying
that a value is of some type means that that value is of some specific
abstract set and that its physical value is given by a particular
representation (which can be viewed as a mapping from physical to
abstract values). A type allows data to be interpreted at a
higher-level, providing support for:
It might also be that type information can be used to associate with
an XML document its DTD and certain PIs.
XML's universal type for everything other than elements is CDATA,
which denotes "character data" (aka strings). The problem is, for
example, that "9" is less than "10" as a number while "9" is greater
than "10" as a string. So if a query contains a less-than operation
applied to the two CDATA values, "9" and "10", the result is ambiguous.
By associating higher-level types with CDATA, this sort of ambiguity
can be resolved. We suggest that the data model allow one to specify
a higher-level type name, like "Int", "DateTime" or "URL", instead of
PCDATA (in element declarations) and CDATA (in attribute
declarations).
Similarly, one should be able to specify higher level types for opaque
data to allow those things to be partitioned into appropriate
classes.
It might be that the type of an opaque value determines the MIME type
to be associated with such a value during transfer and might also
determine transport encoding and/or compressions.
XML elements, since they are tagged by namespace-qualified names, are
adequately typed. However, there is value to treating XML subtrees as
"objects". In particular, a query, instead of operating on the
content of an element, should be able to apply a method to the object
and the method's implementation (as defined by the object's class)
would operate on the content.
This recognizes that an XML element often represents some higher-level
entity, such as a poem, a list of stock quotes, the parameters for a
experiment, a query or a schema. By using the object paradigm,
queries can be made more intuitive, operating in terms of methods
defined for that class of objects.
Also, since a query operates at the method level, the actual state is
encapsulated which means, among other things, that an object's state's
representation can be mutated without forcing queries (or the schema)
to change. And if there is an appropriate form of inheritance, there
can be many different XML object classes that implement the same
interface.
Presumably these objects, in addition to having methods, would have
attributes (also called properties) and could be in relationships.
There could be a standard method on objects, say copy(), that returns
the object's state as XML. Objects would have identity, presumably
represented by ID attributes.
Note that having an object model might also allow for more efficient
mapping of lower-level object facilities, like those of OODBs, into
the data model. In particular, XML objects could "front" for some
lower-level entity and the method's implementation can actually
operate at the more efficient, lower level.
In our limited experience with XML, we have found that we end up using
the same choice lists (element content lists separated by "|") over
and over. We believe the data model should allow one to name these
choices and the query language should allow these choice names to be
used in path expressions.
For instance, one might create a type name, "publication", for books,
articles and pamphlets. Then "publication" could be used in an element
content description to mean the choice list "(book|article|pamphlet)"
and "publication" could be used in a query's path expression to denote
any of those element tags.
It is recognized that one could create "wrapping" elements to achieve
the same thing, but we found that such wrappers made the resulting XML
very verbose, especially when the DTD is highly recursive.
We believe it is impractical for there to be one fixed set of types
both because there are too many of them and because the set is
constantly growing. Thus data sources should be able to define their
own types. There should also be a way for collections of data sources
to share sets of types (so that each type means the same thing across
the collection of data sources).
We believe it is important for data sources to be able to describe
themselves. In particular, a data source should be able to return its
own schema as the result of some standard query. Presumably the
schema is returned in XML, possibly annotated with a style sheet for
viewing in regular web browsers. The schema would announce the types,
functions, predicates and object classes supported by the data source
as well as giving the type of each data value exported.
Since schemas may evolve, a version id should be associated with (and
contained in) each instance of the schema. Then a version id can be
sent with a query and an exception thrown if the schema has evolved in
an incompatible way.
It is desirable to be able to carry arbitrary text transparently
within an XML document. XML's CDATA section only allows text that
doesn't contain the CDEnd string ("]]>"). What is needed is a simple
reversible mapping that takes arbitrary text into text without CDEnd.
Then to carry text data within XML, apply the mapping and wrap it in a
CDATA section. To retrieve the data, extract it from the CDATA
section and apply the reverse mapping. (It would be nice if this
mapping also correctly handled end-of-line mapping and charset and
character encoding issues).
It is also desirable to carry binary data transparently. One approach
is to define a reversible mapping from binary data to text without
CDEnd (like MIME's Base64) and encapsulate within CDATA. Another
approach is to use some sort of multipart representation so that
binary data can be carried in its native form over 8 bit transports.
(In either case, it would be nice if the encoding could carry a MIME
type with the data.)
It is useful for queries to be data, so that
2. Approach
3. Role of XML
4. Role of a Query Protocol
5. Properties of the Data Model
5.1. Support for Opaque Data
5.2. Support for User-defined Operations
5.3. Support for Higher-level Data Types
5.3.1. The role of types
5.3.2. Adding higher-level types for CDATA
5.3.3. Adding higher-level types for opaque data
5.3.4. Allowing elements to be treated as "objects"
5.3.5. Allowing element "choices" to be named
5.3.6. User-defined types
5.4. Self-Identifying Data Sources
6. Other Suggestions
6.1. Data transparency
6.2. Queries as Data