Query Language Requirements for a Distributed GeoSpatial Data Clearinghouse
John D. Evans <email@example.com>
The U.S. Federal Geographic Data Committee (FGDC) works to facilitate the
sharing of geographic information through a National
Spatial Data Infrastructure.1
One component of this effort is a distributed directory of geospatial data
known as the National Geospatial Data Clearinghouse.2
Through this Clearinghouse, users can find out what geospatial data exist,
locate the data collections they need, evaluate their usefulness for particular
applications, and retrieve or order them from providers.
M.I.T. Dept. of Urban Studies & Planning
Federal Geographic Data Committee
This Clearinghouse works through search and retrieval across distributed
repositories of geospatial metadata,3
based on the ANSI/NISO Z39.504
search-and-retrieval protocol and the FGDC/ASTM Content
Standard5 for Digital
Spatial Metadata. (A 1994 executive
order6 requires all
U.S. federal agencies to document their geospatial data according to this
Standard). From the early days of the Clearinghouse in 1994, it has relied
on both full-text searches and "fielded search" on SGML versions of these
metadata files, using an FGDC Metadata
DTD7 (Document Type
Definition). This metadata structure, and the common Z39.50 protocol, have
proven effective in getting the Clearinghouse effort underway: it now counts
close to 100 nodes, maintained by independent organizations around the
However, further growth and broader interoperability of the Clearinghouse
will demand a more generic approach: Z39.50 fielded search against an SGML
structure leaves a lot of the needed functionality to client and server
components, thus limiting their diversity and slowing the growth and impact
of the distributed Clearinghouse. What's needed is a more general set of
spatially enabled "catalog services" within a heterogeneous environment.
This in turn will require a generalized query language that can support
the variety of spatial and alphanumeric searches and queries currently
performed through the Clearinghouse.
This position paper discusses requirements of geospatial data and metadata
query, including some that may be less apparent in other domains: inequality
operators for spatial search, nesting and recursion of queries, aggregation
functions, and the expression of a query's context.
Spatial search and inequality operators
One important requirement for Clearinghouse queries is spatial search,
which chooses data or metadata elements based on their geographic location.
For example, the query might define a "search rectangle," as depicted here
using the U.S. Navy's Master Environmental Library (MEL) interface. This
query would need to find, say, an aerial photograph whose metadata lists
the following bounding coordinates:
These coordinates, often expressed in degrees of latitude and longitude,
define a "data rectangle" which must be compared to the one given in the
user's query. This sort of comparison, within a continuous 2-D or 3-D space,
relies on the inequality operators (<=, >=, <, and >). A query language
that only supports simple string pattern-matching will be unable to handle
Nested and recursive queries
Another important aspect of geospatial metadata: the lineage and processing
history of data often determine their adequacy for a particular use. Although
this is true of any data, the problem is especially important with geographic
data, which are approximate representations of real-world features (e.g.,
coastlines, roads, or watersheds). The precision and accuracy of this approximation
vary across different data sources, and often change as the data undergo
various processing steps (thinning, resampling, projection, etc.). Thus,
metadata search and query in the geospatial domain should be able to trace
back through the "family tree" of a data resource. This may require nesting
or recursion of queries across several metadata documents (not unlike a
New spatial metadata is required whenever a new piece of data is derived
from one or more sources. For instance, an image mosaic assembled
from adjacent aerial photographs, as depicted below, needs a composite
metadata document built from the original metadata document(s).
For instance, each of the four source images shown here has a metadata
document listing the date of photography, as in the following example:
Metadata for the image mosaic should include a date range obtained by minimizing
and maximizing the date field of the individual sources. In general,
creating composite metadata documents will require a full set of aggregation
operators: depending on their semantics, different fields might be summarized
by a count, sum, or mean. All of these aggregate operators
(and perhaps others as well) will be important to the proper handling of
geographic information and its metadata.
Expressing client-side and server-side context
Many kinds of distributed queries require some means of expressing the
background, or context,8
information needed to properly understand the query. For example, a "shopping"
query might need to express the currency in which asking and selling prices
are to be understood.
In the geographic domain, this context information can also be crucial:
for instance, spatial queries are meaningful within a particular coordinate
system and planar projection (Mercator, Lambert, etc.), which must be expressed
unambiguously to Clearinghouse servers. For instance, as illustrated above,
a rectangular search area (or a satellite-image footprint) that includes
the North Pole is poorly expressed as a simple latitude-longitude bounding
box. Rather, the query's chosen projection should be communicated to the
server by means of a half-dozen geodetic and cartographic parameters. Omitting
this context information about a spatial query, or making simplifying assumptions,
can lead to unexpected query
Some geospatial queries (bright
blue) are poorly approximated by a latitude-longitude bounding box (dark
blue). (From Swick and Knowles9).
North Polar Stereographic projection
Simple latitude-longitude grid
(Equatorial Cylindrical Equidistant projection)
Finally, queries across the FGDC Clearinghouse share a number of requirements
with queries in other domains. One of these is the ability to refer unambiguously
to hierarchically-defined data items within complex structured documents.
For instance, queries should distinguish a photograph's publication date
from its photography date, even though both fields are named <caldate>
in their respective blocks. Second, in order to handle complex constraint
("where") clauses, the Boolean operators (or / and / not) will also be
needed, along with parentheses that define the order of evaluation. Third,
queries against one or more XML documents should themselves construct valid
XML documents, and not just return unstructured information.
1. Federal Geographic Data Committee, 1998. National
Spatial Data Infrastructure.
2. Federal Geographic Data Committee, 1998. FGDC Geospatial
Data Clearinghouse Activity.
3. Federal Geographic Data Committee, 1998. FGDC Metadata.
4. Finnigan, Sonya, and Ward, Nigel, Z39.50 Made Simple.
5. Federal Geographic Data Committee, 1998. Content
Standard for Digital Geospatial Metadata.
6. Executive Office of the President, 1994. Executive
Order #12906: Coordinating Geographic Data Acquisition and Access: The
National Spatial Data Infrastructure.
7. Schweitzer, Peter, Nebert, Douglas, Miller, Eric, Hart,
Quinn, Frew, Jim, and Warnock, Archie, 1998. FGDC Metadata DTD 2.0.0.
8. The Context Interchange Project at MIT.
9. Swick, Ross S., and Knowles, Kenneth. Geographic
Database Search Interfaces and the Equatorial Cylindrical Equidistant Projection.