Fulcrum's Position on WWW Distributed Indexing and Searching
World Wide Web Consortium Distributed Indexing/Searching
Workshop
Fulcrum Technologies Inc. Position Paper
Mike Heffernan, Glen Seeds
In response to the Call for Participation for the above workshop,
Fulcrum Technologies Inc. submits for discussion position on the
issues requested for comment.
Robot Exclusion, Web Crawling and Web Indexing
The general approach of "web crawling" - that of unleashing a
software program to periodically retrieve and index the contents
of the Internet into a central searchable collection - is an
inherently unscalable.
A far stronger approach is to avoid large central indexes, and adopt
a strategy of distributed indexing and searching.
Distributed Indexing and Distributed Searching
Fulcrum support the continued evolution of standard protocols
(like z39.50) and query syntax that enable distributed search
applications in an immediate response model.
Agent based search distribution is an important long term technique.
Agents should traverse a set of distributed indexes, as opposed to
executing a direct web crawl. Agent based queries will ultimately be
more effective, as they do not need to return immediately with a
quick result.
The significant challenges in introducing distributed indexing
and distributed searching are not necessarily technical in nature.
Large Web indexes are now corporate assets that require significant
investment in equipment, software and bandwidth to construct and
maintain.
In order for distributed indexing to succeed, some consideration will
need
to be given to the business model.
Matching Semantics of Document Properties and Meta-data
There must be a common syntax for recording meta-data and a common
semantic ontology for interpreting it. Fulcrum supports the
development of a common syntax and basic ontology to allow for
rudimentary meta-data extraction from data sources and document
collections.
In order for this development to be practical and timely, it will be
necessary to limit its scope. Some of the current efforts to capture
an web oriented ontology, like Yahoo, could serve as the foundation
for this effort.
Merging Result List Scores
Fulcrum supports the development of a clear multi-vendor standard to
allow document index collections to compare and rationalize the
statistical weight of a query against the collection as a whole.
The resulting collection weighting should be engineered to be a
viable mechanism for subsequent normalization of the individual
document relevance scores presented to the user.
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.