1. Introduction
Rather than attempt to be exhaustive, we will offer a few thoughts
below on some emphases that Oracle considers important in providing query
languages for the Web. Omission of a topic should not be taken to
imply that we do not consider it to be a requirement or a part of a feasible
approach.
Since this is a brief contribution to an initial workshop, we choose
to concentrate on database aspects of querying XML documents, where we
may be able to complement submissions being made from different perspectives.
2. Position
2.1 Simplicity and Commonality
We hope to see high priority given to simplicity of concepts and exploitation
of commonality. The success of the Web as a whole owes much to these
virtues, and XML is a simplification of SGML which promises to find a wide
range of applicability. Likewise in the query language world, the
simplicity of the relational model and the range of applicability of its
query concepts have been paramount in its widespread adoption. It
would be ironic indeed if in attempting to bring together these fields,
we lost sight of the key reasons for their successes.
2.2 Scope
The semantics of a query language for XML should ideally span the complete
range from querying a single document (as in XSL usage) to querying vast
collections of XML documents, since there is no sharp dividing line in
the spectrum, and no apparent technical obstacle to using a common semantic
model. Users benefit from ease of learning and use, and vendors benefit
from focusing technology development on a single model.
2.3 Semantics and Syntax
We see a major requirement as being for a language with well-conceived
and well-defined semantics. The semantic definition should reach
at least the level of precision of the ANSI/ISO SQL standard, or might
even be more formal. Different syntaxes might then be used as appropriate
for different purposes, e.g. an XML syntax, or a path syntax as in XSL/XQL,
or an SQL/OQL-like syntax, and possibly others.
2.4 XML Data Model
Fundamental to query semantics is the precise definition of the underlying
data model assumed for an XML document (with or without DCD). Questions
such as the treatment of attributes, links, and element identity need to
be addressed (see Locatability below).
2.5 Query structure
Our hope would be that, in the interests of exploiting commonality,
a single general semantic form of query would satisfy all requirements
(as above, this could have various syntactic representations) - something
of this nature:
select <result>
from <source list>
where <predicate>
Each source in the <source list> could be the name (or alias) of
a single object, or an SQL/OQL-like construct associating an alias with
each member of a collection of objects in turn. An object might be
of any type supported by the query language, including "XML", so that the
sources being queried might be all XML, or none of them XML, or a mixture
of XML and other sources. Even an object type other than XML might
have an attribute of type XML nested within it. Each source could
in general be specified by an expression, which might be, or contain, a
query of appropriate result type.
The <predicate> would then be used to filter each combination of
sources in turn, and for each combination for which the expression in the
<predicate> evaluated to true, the <result> would be computed, i.e.
the overall result in general would be a collection of evaluated <result>s.
Again, the expression in a <predicate> might contain queries.
So what kind of expression is permitted in specifying the <result>?
Since this is where we have seen a number of different proposals, we will
consider this question next under its own heading.
2.6 Nature of Query Results
For some purposes in the XML world, a query may be intended to (literally)
select from its sources, rather than computing anything new from them.
That selection may be described as a collection of elements, or as an XML
document (with or without DCD?). The question of whether a pure selection
makes its own copy of the source, or is a reference to the original source,
is important enough that we will discuss it under the heading "Locatability"
below.
However, for other purposes, the result needs to be a newly constructed
or transformed XML document, or a piece of derived information such as
the sum of salary and commission values that might be wrapped to become
an XML document.
But in this last case, is it always desirable to add the XML wrapper
around a number, e.g. if the query is to be nested and its numerical result
is to be the operand of a comparison?
Our position is that we would like to encourage exploration of the
general approach that the <result> may be an expression of any type,
and that this determines (as is the case, e.g., in OQL) the overall result
type of the query, and where it may validly be used in a <source list>
or <predicate> or <result>. This is both for the convenient
use of nested queries when dealing solely with XML sources, and for the
smooth use of the same model when dealing with both XML and other data
types.
Where it is desired that the result be a constructed XML document,
then some kind of constructor function may be needed conceptually in the
result. (We say "conceptually" because we are still discussing the
semantic model here, and some syntactic forms might make the XML constructor
implicit.) If data extracted or computed from the sources are
to be inserted into some XML template templateN, say, then the constructor
could be implemented as a function templateN, and the query would select
templateN(...,...,...) as its result.
2.7 Locatability
In some XML situations, it may be valuable for a query result to contain
references to elements of an XML document rather than copies of the elements.
Thus in XQL, it appears to be possible to navigate to ancestors of an element
in the tree structure of a document, where that element may be the result
of a query, which would not be possible if the result were a copy rooted
in that element. We would like to obtain a better understanding of
the concept of a reference to an XML element - is it the same as a possible
value of a link?
Defining the result of a pure selection to be a reference to an element
has the advantage of retaining maximal information, and it is always possible
then to discard information by dereferencing and making a copy if desired,
e.g. in a tool using the top-level result of a query.
However, the most striking difference between copy and reference semantics
is of course when updates are possible, and this leads us to the next topic.
2.8 Updatability
We have seen little discussion of updatability in the context of query
languages for XML, but it is of course a major consideration in the database
environment.
First, in scaling up to deal with large numbers of XML documents, the
query language can offer the same kind of set-at-a-time declarative power
for insertion, update, and deletion as it does for retrieval.
Even if updates to XML documents in a database are being made via an
XML editor, there are the more general database mechanisms of authorization
and tranasaction management to be taken into account.
And the questions of copy v. reference semantics for a query become
entwined with update when, as in SQL, a cursor is opened to iterate over
the set of results from a query, or when a query is used to define a view,
and updates are performed via the cursor or the view.
We believe that all these aspects of updatability need to be addressed
when specifying an XML query language that will scale into the database
environment.
2.9 Querying Text
Besides providing structural queriability over XML features such as
element types and attributes, an XML query language should also incorporate
powerful
text-searching capabilities over the textual content of XML documents.
See e.g. the SQL/MM FullText extensions to SQL3.
2.10 Spanning XML and Other Data
We do not see XML rapidly ousting all other forms of storing information
in databases, and hence we see not only an immediate need, but also a long-term
need, for queries to be able to span XML and other data. Fortunately,
the general model we have been espousing above lends itself to this, treating
XML as one type of data among many, each data type having its own functions
and operators usable in expressions.
2.11 Quality, Scalability, and Economy
To take advantage of widespread existing knowledge of querying databases,
and for reasons of quality, scalability, and economy in implementation,
it is very desirable to be able to extend the powerful, robust, and efficient
existing database query engines to support XML queries. This will
also facilitate integration of queries over both XML documents and other
database data.
We believe this evolutionary approach benefits users as much as vendors,
in that it is the only way that today's technological expectations of a
database system can be carried forward into the XML world. It would
take several years to build new database systems to surpass today's relational
and object-relational systems which have been maturing for 20 years, and
meanwhile today's systems would have moved further ahead. Moreover
the integration of XML and non-XML data afforded by extension of existing
database systems would be hard to match when building new systems.
2.12 Non-Requirement
We assume it is not an initial requirement to define a user-friendly
search engine kind of query language, since there is unlikely to be early
consensus on exactly what results should be returned. However, such
engines can be built on top of programming interfaces to a query language
with well-defined semantics, and experience may ultimately lead to consensus
on a higher-level search language.