DI/SW Position: Search and Meta-Search on a Diverse Web
Research Institute for Computing and Information Systems
University of Houston - Clear Lake
Houston, TX
One of the most significant challenges facing builders of indexing and search
systems for the Web is the diversity of goals and capabilities of the
content providers - the operators of the thousands of servers we so readily
view as a single information resource. In operating the
RBSE Spider
[1], we've encountered reactions ranging from 'stay off
my server' (usually expressed as a blanket exclusion clause in /robots.txt)
to 'why haven't you indexed us yet?' (usually expressed in a mail message
directly to me...).
A Tale of Three Prototypes
Two things remains clear through all of this - users want to track
information relevant to their interests and are increasingly demanding
efficient access to information. We
are currently involved in the design, development and evaluation
of three complementary systems to address these issues.
The MORE repository system
[3] is a meta-data based cataloging environment,
providing separate hierarchies of meta-classes and collections and support
for controlled access to proprietary collections through the definition of
user groups.
The RBSE Spider
[1] retains both the
structure of the Web in a relatonal graph representation
and a full text index of the HTML documents encountered.
The spider selects candidates for retrieval and indexing
using a set of cached heuristics. The architecture readily supports
multiple discovery modes through respecification of the candidate retrieval
Sulla [4] is a user agent with the ability to
acquire and act upon an interest profile of its user and the ability to act
ethically [2].
The Pragmatics of Indexing and Search on the Web
Given the size of the Web and the diversity of its contents, how do you
build a useful index? We've taken a non-traditional tack with the Spider.
Our current architecture
supports the exclusion standard, but also allows the operator to specify
constraint patterns that candidate URLs must match against to be indexed
and concept profiles (currently high relevance terms) that are used to
rank newly identified URLs for indexing. The result is an index, that with
only 40,000 documents, performs as well as Alta Vista in certain concept
areas (e.g., agents and ontologies).
Sulla interrogates a variety of search engines, each with its own search
algorithm and scoring scheme. We've experimented with a number of
approaches to merging returned results and have settle for the moment on
the relative rank of a hit from a given engine as the basis for generating
aggregate scores.
What Next?
The robots.txt file contains little information regarding server
performance/load - and the rate its operator is willing to be accessed by
an agent. On a global basis, this is our prime interest. On a local
basis, we're shifting our tools from simple word indexing to concept
indexing. A shared project in this area offers far more probability of
success than convincing the world to do their own tagging.
- Eichmann, D. ``The RBSE Spider - Balancing Effective Search Against
Web Load,'' First International Conference on the World Wide Web,
Geneva, Switzerland, May 25-27, 1994, pages 113-120.
- Eichmann, D., ``Ethical Web Agents,'' Proc. Second International
World-Wide Web Conference: Mosaic and the Web, Chicago, IL, October
17-20, 1994, pages 3-13.
- Eichmann, D., T. McGregor and D. Danley, ``Integrating Structured
Databases Into the Web: The MORE System,'' Computer Networks and ISDN
Systems, v. 24, n. 2, 1994.
- Eichmann, D. and J. Wu, ``Sulla - A User Agent for the Web,'' poster,
Fifth International Conference on the World Wide Web, Paris, France,
May 6-10, 1996, poster proc. pages 1-9.
This work is supported in part by a grant from Texas Instruments, Inc.,
by NASA as part of the Repository Based Software Engineering
program, Cooperative Agreement NCC-9-30, research activity RB-02A,
and by a grant from the Texas Advanced Technology Program.
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.