URCs as a substrate for distributed searching

Ron Daniel Jr.
Advanced Computing Lab
MS B287
Los Alamos National Laboratory
Los Alamos, NM, USA 87545

The increasing number of resources on the web makes centralized indices less and less satisfactory. Some form of distributed cataloging and indexing effort seems necessary. But exactly what form? What sort of cataloging information should be collected? Who will create the descriptions and how will they be managed over the lifetime of the resource and beyond? What protocols will be used to transfer the descriptions? How will queries be encoded and what query facilites will be provided? What sort of forward knowledge must be propagated to allow reasonable query forwarding?

These questions cannot be answered once and for all. We must have a system that can adapt to change, that can allow many different experimental solutions to co-exist, while preserving to the greatest degree possible the intellectual investment that has been made in describing network resources. The library community has already shown us the power of shared cataloging.

Uniform Resource Characteristics (URCs) were proposed by the IETF's Uniform Resource Identifier working group as a structure for containing information on networked resources. The rough documents for URC standards specify an abstract service that can have many different concrete realizations, and specifies how those different realizations can interoperate. The key ideas behind URCs are:

Allow for a variety of attribute sets, known as "URC subtypes".
An attribute set that is appropriate for describing HTML pages is not likely to adequately describe scientific datasets. Any reasonable indexing system must allow different descriptive schema to be used, and must address namespace conflicts between the schemes.
Standardize the meaning of a very few elements.
Having a variety of descriptive schemes means that systems will frequently encounter descriptions in unknown schemas. However, elements such as URL, URN, URC, and Content-type have a rigorous definition and are so pervasive that standardizing them will allow a great deal of useful work to be performed even when the whole of the descriptive scheme is not known.
Don't specify one syntax, instead specify a canonical representation that can be mapped into and out of a variety of syntaxes.
Specifying one syntax is a recipie for disaster over the long haul, and leads to religious battles in the short term. PICS, IAFA, SGML, and MARC have adherents for reasons. An appropriate canonical representation should accomodate all of these.
Standardize the basic operations to manipulate the canonical representation, and let different query and transformation languages be developed to utilize those operations in novel ways.
Services will want to compete on the simplicity or power of their search capabilities. A means for allowing that will also allow search capabilities to gracefully evolve.
Don't specify one protocol, instead specify how the canonical representation and the operations on it are encoded in particular protocols.

At the W3C Workshop on Distributed Indexing/Searching I would like to present a summary of the current state of the URC effort and give some examples of their use for distributed resource discovery.

For more information, see http://www.acl.lanl.gov/URC/

Last Modified: 14 May, 1996
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.