An Architecture for Aggregating Distributed Data and Meta-Data Objects

Position Paper for W3C Distributed Indexing/Searching Workshop

Carl Lagoze, Cornell University, Computer Science Department- lagoze@cs.cornell.edu

The Cornell Digital Library Research Group has been researching and developing protocols and architectures for distributed digital libraries. One result of this work has been Dienst, a protocol and reference implementation for distributed multi-format document libraries. Dienst is the technical foundation for the Networked Computer Science Technical Reports Library (NCSTRL), a collaborative effort by a number of universities to make their computer science technical reports and other relevant materials available to the public.

Our model for distributed indexing in the current implementation of Dienst is rather simple. Each document is stored in a repository server as a uniquely-named (with handles) aggregation of meta-data (cataloging data) and one or more representations, or formats, of the document. Each repository has an associated index server that scans the bibliographic entries for the documents, indexes the contents, and responds to simple bibliographic searches. User search, browsing, and retrieval access to the distributed collection is provided by a set of user interface servers, which broadcast a user search to individual index servers and then combine the returned results sets into a uniform hit list. A set of backup index servers, which periodically gather indexing information from the distributed indexes, provide a level of fault tolerance for the distributed indexing process.

Our current work is focused on near-term enhancements to the existing Dienst architecture and, for the longer term, the definition of a more flexible and extensible repository and indexing infrastructure. Our near-term goal is to move from the current broadcast search model to a selective broadcast model. In this model, a user query is first filtered through a large granularity search, which returns a set of index servers, to which a fine granularity (document level) search is submitted. We are examining two technologies, GlOSS and Harvest, as the means of achieving this goal.

Our longer term infrastructure work is based on the Kahn/Wilensky Framework for digital library objects. We are motivated by a number of requirements related to meta-data. An individual digital library object may have multiple packages of meta-data associated with it. For example, the meta-data for an object may be a combination of a full MARC record; a simple cataloging record such as that represented by the Dublin Core; a description of the terms and conditions describing the rules of access to the object; and the like. Each of these meta-data packages may themselves be library objects, with associated meta-data. A meta-data object may be shared by several digital library objects. For example, the set of objects in a single repository may share a single terms and conditions meta-data object. Furthermore, a meta-data object associated with a digital library object may reside in a separate repository. Finally, there is the need to accommodate complex meta-data types (e.g., Java applets), such as those required to mediated complex terms and conditions.

In response to these requirements we are examining and prototyping a container architecture for aggregating related digital library objects. The container architecture is recursive, in that objects within containers may themselves be containers; distributed, in that objects may be indirectly referenced from containers using URN's; and extensible, in that objects are strongly typed within a CORBA-based distributed object framework and the type system can be extended through the a CORBA-like type registry. We plan to use this architecture as the basis for the next generation of Dienst and as a foundation for future research in distributed indexing, object replication, searching over heterogeneous meta-data, and methods for expressing terms and conditions to protect intellectual property.

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.