Position Paper: DSTC Requirements for a Web Query Language

Background

The DSTC has a number of resource discovery products, including: metadata aware search engines, metadata repositories and distributed search services. All of these can be queried via the web. Some of them in turn query other web databases. A standard web query interface would allow better interoperability between our products and other web based metadata software.

This document outlines our requirements for a web query language for these products.

Some Scenarios

We want to support structured boolean querying of metadata repositories.

Example 1
Find all records with Creator equal to "J. Smith" and Date after "January 1997".

The metadata in the repositories will conform to a variety of domain specific metadata standards. We want the ability to be explicit about the origin of the metadata element being queried and the structure of our query values.

Example 2
Find records with VCARD Name equal to "J. Smith" AND with ISO 8601 encoded Dublin Core Date after "19970701".

We also would like to be able to specify the metadata fields to return in results and the number of records returned.

Example 3
Find records with VCARD Name equal to "J. Smith". Return the Dublin Core Date, Dublin Core Description and Dublin Core Identifier of the first 20 records.

Some metadata has nested structure. For example, metadata describing a film may contain metadata describing sequences within the film. The sequence metadata may contain metadata describing individual scenes and so on. The query language should support queries on metadata with nested structure:

Example 4
Return any Dublin Core Descriptions for the first, second and third MPEG7 Scenes from movies with Dublin Core Creator "Martin Scorsese"

Some information communities use distributed search engines to simultaneously query existing heterogeneous information sources. Such applications are enhanced if it is possible to dynamically discover the schema of the underlying information sources.

Example 5
What query attributes does this repository support?

Requirements

Attribute based boolean query language.The query language should be able to specify attribute based boolean queries.

Multiple attribute sets. Different communities will require their own sets of attributes. For this reason, the query language must be flexible enough to allow attributes from different communities. The query language and attribute sets should be able to be developed separately. That is, the W3C should develop the query infrastructure and information communities should develop the attribute sets they require.

Sharing of Attributes. Communities will not want to reinvent the wheel every time they need a new attribute. Attributes must be able to be shared between communities. An important part of sharing is the identification of the origin and definition of the attributes in a query.

Identifying the source of attributes also allows attributes from different communities to be mapped. For example, an application can know that Dublin Core Creator is the same as GILS Author and map a Dublin Core query onto a GILS database.

Attribute Categories. Attributes tell the server how to interpret the values given in the query. There are a number of categories of attributes that an information community may wish to define. For example

The field to search on (e.g. Dublin Core Date)
The matching relationship between the field and the query value (e.g. equals, after)
The encoding and type of the query value (e.g. ISO 8601 encoded, or 16 bit integer)

Interoperability and Extensibility. A number of us have the dream that one day there will be a "Lowest Common Denominator" or "Cross Domain" attribute set that every metadata repository supports. This allows a base level of interoperability across metadata repositories.

Information communities should obviously be allowed to extend on this base set of attributes for their private use.

Discovery of Attributes. It should be possible to discover the attributes (and possibly attribute definitions) being used by a metadata repository. This enhances interoperability by allowing an information client to configure itself to query newly discovered metadata repositories.

Ease of Implementation. It should be easy to implement a search engine supporting the query language.

The DSTC recommends that the query language use HTTP as the transport mechanism and that the syntax of returned metadata records should be based on XML, possibly in RDF format.

Security/Authentication. Some customers require secure or authenticated access to their data or subsets of their data. The query infrastructure should support this.

Specification of Returned Results. Including the specification of result format and fields, and the size of the result set to be returned.

Internationalisation. The query infrastructure should support queries and results described using the Unicode character set. Additionally, the query infrastructure should be able to identify the language of the query values and returned records.

Other Work

Other groups have looked at the issues of web based query languages and information retrieval infrastructures. We should take care to learn from their experience.

The new Z39.50 Attribute Architecture provides a (non-web) infrastructure for supporting most of our requirements.

The Stanford STARTS project examined using HTTP/CGI as the transport for Z39.50 queries.

Nigel Ward <nigel@dstc.edu.au>
Renato Iannella <renato@dstc.edu.au>
Hoylen Sue <hoylen@dstc.edu.au>
Rob McArthur <mcarthur@dstc.edu.au>
Jane Hunter <jane@dstc.edu.au>

DSTC Pty Ltd
Resource Discovery Unit
Research Data Network CRC
Level 7, General Purpose South Building, The University of Queensland, Qld 4072, Australia.

Last modified: Wed Nov 18 15:12:12 EST 1998