Experiences Designing Query Languages for Hierarchically Structured Text Documents

Experiences Designing Query Languages for
Hierarchically Structured Text Documents
Dr. Mary Ann Malloy
John C. Schneider
The MITRE Corporation

Background

The MITRE Corporation is an independent, not-for-profit company that works as a strategic partner with its government clients to address public interest issues of critical importance. MITRE was established in 1958 by employees from the Massachusetts Institute of Technology (MIT) Lincoln Laboratory to integrate the first electronic command and control (C2) system, designed for continental air defense. Although the first three letters of MITRE's name help to honor its origin, the name was chosen for its meaning, "the fitting together of surfaces."

C2 Information Interoperability

Today, command and control has become a thriving industry generating thousands of individual information systems in the United States and across the globe. The challenge of "fitting together surfaces", or in common terms, interoperability, has grown exponentially with the number of systems.

One of the most challenging aspects of this problem is information interoperability, or the ability to exchange information services. To be effective in today's military environment, a wide variety of dissimilar information systems, developed independently by different countries must share timely, accurate, complete information in a form that is mutually understandable. They must do so within the competing force's decision cycle despite differences in cooperating force cultures, command structures, operational procedures and languages.

Message Text Formats

To address the C2 information interoperability problem, the United States and its Allies have invested a great deal of time and resources formalizing information standards to reduce the ambiguity of natural language and increase opportunities for automation. One of the most prolific of these standards is the Message Text Format (MTF) standard, which began almost 30 years ago. Today, the United States MTF (USMTF) program governs a significant portion of all structured text information exchanged to support the full spectrum of U.S. military operations, including intelligence, air operations, fire support, maritime operations, logistics, medical, etc. It includes an actively managed repository of meta-data describing over 300 classes of hierarchically structured, strongly typed, text documents sharing over 5500 standard data element definitions. The North Atlantic Treaty Organization (NATO) MTF program manages over 225 additional document types to support a similar array of NATO operations.

MITRE's Role

For over 15 years, MITRE has assisted the government by conceiving, exploring and prototyping proof-of-concept technologies to exploit the MTF standard. Many of these concepts and technologies mirror those being developed or considered by the WWW Consortium (W3C) community today for the Extensible Markup Language (XML). For example, MITRE's contributions to the MTF standard include:

A validating/non-validating parser, which also provides selective validation
An object-oriented, hierarchical document and meta-document object model
A validating document creation and editing system
A query language for extracting and transforming document elements
A language for specifying and enforcing constraints between arbitrary document elements
A document processing system for extracting and delivering document elements in accordance with registered client application requests
A hypertext help system for browsing meta-data repositories
A document compression technology that exploits shared meta-data to achieve unprecedented savings in bandwidth constrained environments.

MITRE's research in these areas has resulted in several successes, lessons learned and new ideas that have yet to be explored. In addition, MITRE has collected a wealth of anecdotal knowledge analyzing the deployment of these technologies in complex, dynamic, real-world environments.

MITRE is interested in sharing this experience as a way to expedite the evolution of XML and its companion technologies toward a cost effective set of standards for integrating the collective multi-national C2 system. MITRE believes the adoption of XML by the MTF community will result in tremendous cost savings, improved technologies and an unprecedented tempo of evolution.

Query Language Features

This section distills some of MITRE's experiences with MTF standards into a collection of recommended WWW query language features. For the purpose of illustration, we refer to one of the query languages we designed, the Allied Query Language (AQL).

A Class of Query Languages

The future WWW information-space will contain an unparalleled collection of documents, applications and users representing a wide range of sophistication, structure and information requirements. Attempts to develop a single query language for the WWW could result in a mechanism that is too complex for the novice user browsing unstructured information and too cumbersome or inefficient for the automated agent analyzing strongly typed information sources. Therefore, we recommend a general class of query languages be developed, which includes specialized querying languages for unstructured, semi-structured and highly structured information. The members of the resulting family of query languages should be capable of invoking one another's services.

Simplicity

The query language(s) should be simple and easily accessible to those who have used query languages in other contexts. To accomplish this goal in AQL, we adopted a syntax similar to the Structured Query Language (SQL), including a SELECT clause for specifying elements of interest, a FROM clause for specifying the domain of interest and a WHERE clause for constraining the result set. We provided consistent, simple extensions to this model as required. The resulting query language makes it easy to define relational views over hierarchical documents.

Robust Element Addressing

The element addressing scheme should facilitate the creation of queries that are sufficiently general to apply to a wide class of document types, including the class of document types defined by successive versions of a single DTD or Schema. Based on our experience, we expect document types to be configuration managed as a succession of versions driven by user requirements, where releases often occur more frequently than the associated software updates. It is therefore important for queries to provide a sufficient abstraction from differences in a class of document types, especially when the query user is uninterested in those differences.

The most general element addresses, which apply to the widest range of document types, specify the smallest amount of context. Therefore, we recommend that element addresses implicitly match any unspecified context to facilitate building queries that are less likely to break due to document type revisions. For example, the element address "author" specifies no context and can therefore be used to match all "author" elements contained in any version of any document. As needed, additional context can be provided to select a particular type of author, such as "newspaper-article/author" or "/project/software-module/author". As indicated by these examples, we believe the element addressing scheme documented in the August 1998 XSL working draft is sufficiently robust.

Unambiguous Element Addressing

The XML 1.0 recommendation allows DTDs to declare sibling elements with the same name. For example, consider the following DTD fragment:

<!ELEMENT road_race (name, start_time, location, end_time, location)>

Note that the two "location" sub-elements of "road_race" share identical names and thus, identical paths. I.e., the path "/road_race/location" identifies both the start location and the end location of the road race. Although the merits of this element declaration are arguable, the declaration is legal. And although we may not have influence to "improve" the declaration, we may need to differentiate between the start and end locations in a query. Therefore, the element addressing scheme should provide a method of differentiating between independent sibling elements with the same name.

There are at least two methods to achieve this differentiation. First, a sibling qualifier could be used to specify the desired element by its context. For example, the element address "start_time[next()]" might be used to identify the element immediately following the "start_time" element. Second, a numerical position qualifier could be used to specify the desired element according to its sequence relative to other siblings with the same name. For example, the element address "location[2]" might be used to identify the second use of "location" in the declaration. Note, the requirement for sibling qualifiers is documented as an open issue in the August 1998 XSL working draft.

Search Domain Specification

The domain of all WWW documents is very large. User's may wish to restrict the domain of a WWW query to a single document or a larger collection of documents. The domain could be restricted by establishing constraints on the document source, modification time, type, etc. For example, consider the following domain specifications:

<xql:from uri="http://my.business.com/employees.xml"> 
<xql:from uri="http://my.hobby.com"> 
<xql:from email="quotes@my.stocks.com"> 
<xql:from email="Accounting Department">
<xql:from time="after 30JUN1998"> 
<xql:from time="last month">
<xql:from doctype="http://my.repository.org/widget.dtd">

Using Boolean operators, domain specifications could be combined to express domains such as "All budget documents published by the finance department last month" or "All quarterly statement documents I receive from my mutual fund by e-mail."

We consider the capability to specify the time domain particularly important for web applications. The web information-space is constantly changing. Therefore, it will sometimes be useful to restrict the query domain to documents that have been updated within a given timeframe to insure the integrity of linked information.

Flexible Query Result Formats

The query language must support a wide variety of users with diverse needs; therefore, it will require flexible options for specifying the query result format. While a casual user may require the query results in plain text with no intervening markup, a database agent may require a relational view of the query results. In addition, there will be users that wish to apply arbitrary transformations to the query result, including element reordering, insertion of intervening text and algorithmic transformations via in-line functions. At a minimum, we recommend the query language support query results formatted as text, a relational table or XML (possibly post-processed using XSL). Of critical importance is the ability to use the query language to map the hierarchical XML structure onto a relational view to facilitate integration with existing databases.

Error Handling

In the rapidly changing WWW environment, robust tools are a necessity. Each DTD or Schema modification could undoubtedly result in a large collection of invalid documents. A successful query infrastructure will provide a mechanism for detecting, reporting and sometimes providing useful results from invalid documents. To support this functionality, the query language itself should provide a mechanism for specifying the level of validity checking and reporting the user desires. Likewise, the query result should provide a mechanism for reporting requested error information to the user.

In the AQL, we provided a basic error handling mechanism with an interesting twist. Users can specify whether they are interested in all validation results, no validation results or only the validation results for the elements selected by the query. The latter feature is very useful in isolating systems from document type changes, which have no impact on the information needed from the document (e.g., new sibling elements). The query result can report whether there are errors in the document, errors in the query result, or both. Using this construct, completely usable results can be extracted from invalid documents when the validity errors occur outside the requested elements.

Query Directed Processing

Executing ad-hoc queries against the web information-space is a formidable task when considering the potential size of the domain. Consequently, we expect query performance to be a recurring issue. One of the ways we alleviated the performance problem in AQL was by using a form of lazy evaluation, called query-directed parsing and validation. When enabled, this feature focused the parsing and validation engines on the minimal subset of tasks necessary to extract and validate the information selected by the query. This improved performance searching large, complex domains.

Searching Document Subsets

At times, the query user is interested in extracting several sub-elements from within a particular document element. To simplify the this task in AQL, we provided a WITHIN clause to re-focus the scope of the query from the entire document to a specified element. We recommend a similar construct to simplify the task of the query designer.

Summary

The U.S. and its Allies have invested a great deal of time and resources to develop widespread, military information standards to alleviate the C2 interoperability problem. For over 15 years, MITRE has assisted the military community by conceiving, exploring and prototyping proof-of-concept technologies for exploiting these standards. MITRE is interested in sharing these experiences to expedite the evolution of XML and its related standards toward a set of cost effective, mature, enabling technologies for C2 interoperability. This paper distills some of MITRE's experiences designing query languages for hierarchical text documents into a list of recommended WWW query language features. We sincerely hope this contribution proves useful in facilitating the development of a set of query languages for the WWW.