DI/SW Position: Search and Meta-Search on a Diverse Web

David Eichmann

Research Institute for Computing and Information Systems
University of Houston - Clear Lake
Houston, TX

Introduction

One of the most significant challenges facing builders of indexing and search systems for the Web is the diversity of goals and capabilities of the content providers - the operators of the thousands of servers we so readily view as a single information resource. In operating the RBSE Spider [1], we've encountered reactions ranging from 'stay off my server' (usually expressed as a blanket exclusion clause in /robots.txt) to 'why haven't you indexed us yet?' (usually expressed in a mail message directly to me...).

A Tale of Three Prototypes

Two things remains clear through all of this - users want to track information relevant to their interests and are increasingly demanding efficient access to information. We are currently involved in the design, development and evaluation of three complementary systems to address these issues. The MORE repository system [3] is a meta-data based cataloging environment, providing separate hierarchies of meta-classes and collections and support for controlled access to proprietary collections through the definition of user groups. The RBSE Spider [1] retains both the structure of the Web in a relatonal graph representation and a full text index of the HTML documents encountered. The spider selects candidates for retrieval and indexing using a set of cached heuristics. The architecture readily supports multiple discovery modes through respecification of the candidate retrieval query. Sulla [4] is a user agent with the ability to acquire and act upon an interest profile of its user and the ability to act ethically [2].

The Pragmatics of Indexing and Search on the Web

Given the size of the Web and the diversity of its contents, how do you build a useful index? We've taken a non-traditional tack with the Spider. Our current architecture supports the exclusion standard, but also allows the operator to specify constraint patterns that candidate URLs must match against to be indexed and concept profiles (currently high relevance terms) that are used to rank newly identified URLs for indexing. The result is an index, that with only 40,000 documents, performs as well as Alta Vista in certain concept areas (e.g., agents and ontologies). Sulla interrogates a variety of search engines, each with its own search algorithm and scoring scheme. We've experimented with a number of approaches to merging returned results and have settle for the moment on the relative rank of a hit from a given engine as the basis for generating aggregate scores.

What Next?

The robots.txt file contains little information regarding server performance/load - and the rate its operator is willing to be accessed by an agent. On a global basis, this is our prime interest. On a local basis, we're shifting our tools from simple word indexing to concept indexing. A shared project in this area offers far more probability of success than convincing the world to do their own tagging.

Bibliography

  1. Eichmann, D. ``The RBSE Spider - Balancing Effective Search Against Web Load,'' First International Conference on the World Wide Web, Geneva, Switzerland, May 25-27, 1994, pages 113-120.
  2. Eichmann, D., ``Ethical Web Agents,'' Proc. Second International World-Wide Web Conference: Mosaic and the Web, Chicago, IL, October 17-20, 1994, pages 3-13.
  3. Eichmann, D., T. McGregor and D. Danley, ``Integrating Structured Databases Into the Web: The MORE System,'' Computer Networks and ISDN Systems, v. 24, n. 2, 1994.
  4. Eichmann, D. and J. Wu, ``Sulla - A User Agent for the Web,'' poster, Fifth International Conference on the World Wide Web, Paris, France, May 6-10, 1996, poster proc. pages 1-9.

Acknowledgements

This work is supported in part by a grant from Texas Instruments, Inc., by NASA as part of the Repository Based Software Engineering program, Cooperative Agreement NCC-9-30, research activity RB-02A, and by a grant from the Texas Advanced Technology Program.
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.