Query Language Requirements for a Distributed GeoSpatial Data Clearinghouse

John D. Evans <jdevans@mit.edu>
    M.I.T. Dept. of Urban Studies & Planning
    Federal Geographic Data Committee

The U.S. Federal Geographic Data Committee (FGDC) works to facilitate the sharing of geographic information through a National Spatial Data Infrastructure.1 One component of this effort is a distributed directory of geospatial data known as the National Geospatial Data Clearinghouse.2 Through this Clearinghouse, users can find out what geospatial data exist, locate the data collections they need, evaluate their usefulness for particular applications, and retrieve or order them from providers.

This Clearinghouse works through search and retrieval across distributed repositories of geospatial metadata,3 based on the ANSI/NISO Z39.504 search-and-retrieval protocol and the FGDC/ASTM Content Standard5 for Digital Spatial Metadata. (A 1994 executive order6 requires all U.S. federal agencies to document their geospatial data according to this Standard). From the early days of the Clearinghouse in 1994, it has relied on both full-text searches and "fielded search" on SGML versions of these metadata files, using an FGDC Metadata DTD7 (Document Type Definition). This metadata structure, and the common Z39.50 protocol, have proven effective in getting the Clearinghouse effort underway: it now counts close to 100 nodes, maintained by independent organizations around the globe.

However, further growth and broader interoperability of the Clearinghouse will demand a more generic approach: Z39.50 fielded search against an SGML structure leaves a lot of the needed functionality to client and server components, thus limiting their diversity and slowing the growth and impact of the distributed Clearinghouse. What's needed is a more general set of spatially enabled "catalog services" within a heterogeneous environment. This in turn will require a generalized query language that can support the variety of spatial and alphanumeric searches and queries currently performed through the Clearinghouse.

This position paper discusses requirements of geospatial data and metadata query, including some that may be less apparent in other domains: inequality operators for spatial search, nesting and recursion of queries, aggregation functions, and the expression of a query's context.

Spatial search and inequality operators

Master Environmental Library (MEL)
Interface to the U.S. Navy's Master Environmental Library (MEL) at http://www-mel.nrlmry.navy.mil
One important requirement for Clearinghouse queries is spatial search, which chooses data or metadata elements based on their geographic location. For example, the query might define a "search rectangle," as depicted here using the U.S. Navy's Master Environmental Library (MEL) interface. This query would need to find, say, an aerial photograph whose metadata lists the following bounding coordinates:
These coordinates, often expressed in degrees of latitude and longitude, define a "data rectangle" which must be compared to the one given in the user's query. This sort of comparison, within a continuous 2-D or 3-D space, relies on the inequality operators (<=, >=, <, and >). A query language that only supports simple string pattern-matching will be unable to handle spatial search.

Nested and recursive queries

Another important aspect of geospatial metadata: the lineage and processing history of data often determine their adequacy for a particular use. Although this is true of any data, the problem is especially important with geographic data, which are approximate representations of real-world features (e.g., coastlines, roads, or watersheds). The precision and accuracy of this approximation vary across different data sources, and often change as the data undergo various processing steps (thinning, resampling, projection, etc.). Thus, metadata search and query in the geospatial domain should be able to trace back through the "family tree" of a data resource. This may require nesting or recursion of queries across several metadata documents (not unlike a Web-indexing "spider").

Aggregation functions

New spatial metadata is required whenever a new piece of data is derived from one or more sources. For instance, an image mosaic assembled from adjacent aerial photographs, as depicted below, needs a composite metadata document built from the original metadata document(s).
Mosaic of four air-photo excerpts from http://silo.mit.edu
For instance, each of the four source images shown here has a metadata document listing the date of photography, as in the following example:
Metadata for the image mosaic should include a date range obtained by minimizing and maximizing the date field of the individual sources. In general, creating composite metadata documents will require a full set of aggregation operators: depending on their semantics, different fields might be summarized by a count, sum, or mean. All of these aggregate operators (and perhaps others as well) will be important to the proper handling of geographic information and its metadata.

Expressing client-side and server-side context

Many kinds of distributed queries require some means of expressing the background, or context,8 information needed to properly understand the query. For example, a "shopping" query might need to express the currency in which asking and selling prices are to be understood.

North Polar Stereographic projection

Simple latitude-longitude grid
(Equatorial Cylindrical Equidistant projection)
Some geospatial queries (bright blue) are poorly approximated by a latitude-longitude bounding box (dark blue). (From Swick and Knowles9).
In the geographic domain, this context information can also be crucial: for instance, spatial queries are meaningful within a particular coordinate system and planar projection (Mercator, Lambert, etc.), which must be expressed unambiguously to Clearinghouse servers. For instance, as illustrated above, a rectangular search area (or a satellite-image footprint) that includes the North Pole is poorly expressed as a simple latitude-longitude bounding box. Rather, the query's chosen projection should be communicated to the server by means of a half-dozen geodetic and cartographic parameters. Omitting this context information about a spatial query, or making simplifying assumptions, can lead to unexpected query results.9

Other requirements

Finally, queries across the FGDC Clearinghouse share a number of requirements with queries in other domains. One of these is the ability to refer unambiguously to hierarchically-defined data items within complex structured documents. For instance, queries should distinguish a photograph's publication date from its photography date, even though both fields are named <caldate> in their respective blocks. Second, in order to handle complex constraint ("where") clauses, the Boolean operators (or / and / not) will also be needed, along with parentheses that define the order of evaluation. Third, queries against one or more XML documents should themselves construct valid XML documents, and not just return unstructured information.


1. Federal Geographic Data Committee, 1998. National Spatial Data Infrastructure.
2. Federal Geographic Data Committee, 1998. FGDC Geospatial Data Clearinghouse Activity.
3. Federal Geographic Data Committee, 1998. FGDC Metadata.
4. Finnigan, Sonya, and Ward, Nigel, Z39.50 Made Simple.
5. Federal Geographic Data Committee, 1998. Content Standard for Digital Geospatial Metadata.
6. Executive Office of the President, 1994. Executive Order #12906: Coordinating Geographic Data Acquisition and Access: The National Spatial Data Infrastructure.
7. Schweitzer, Peter, Nebert, Douglas, Miller, Eric, Hart, Quinn, Frew, Jim, and Warnock, Archie, 1998. FGDC Metadata DTD 2.0.0.
8. The Context Interchange Project at MIT.
9. Swick, Ross S., and Knowles, Kenneth. Geographic Database Search Interfaces and the Equatorial Cylindrical Equidistant Projection.