Fulcrum's Position on WWW Distributed Indexing and Searching

World Wide Web Consortium Distributed Indexing/Searching Workshop

Fulcrum Technologies Inc. Position Paper

Mike Heffernan, Glen Seeds

In response to the Call for Participation for the above workshop, Fulcrum Technologies Inc. submits for discussion position on the issues requested for comment.

Robot Exclusion, Web Crawling and Web Indexing

The general approach of "web crawling" - that of unleashing a software program to periodically retrieve and index the contents of the Internet into a central searchable collection - is an inherently unscalable.

A far stronger approach is to avoid large central indexes, and adopt a strategy of distributed indexing and searching.

Distributed Indexing and Distributed Searching

Fulcrum support the continued evolution of standard protocols (like z39.50) and query syntax that enable distributed search applications in an immediate response model.

Agent based search distribution is an important long term technique. Agents should traverse a set of distributed indexes, as opposed to executing a direct web crawl. Agent based queries will ultimately be more effective, as they do not need to return immediately with a quick result.

The significant challenges in introducing distributed indexing and distributed searching are not necessarily technical in nature. Large Web indexes are now corporate assets that require significant investment in equipment, software and bandwidth to construct and maintain. In order for distributed indexing to succeed, some consideration will need to be given to the business model.

Matching Semantics of Document Properties and Meta-data

There must be a common syntax for recording meta-data and a common semantic ontology for interpreting it. Fulcrum supports the development of a common syntax and basic ontology to allow for rudimentary meta-data extraction from data sources and document collections.

In order for this development to be practical and timely, it will be necessary to limit its scope. Some of the current efforts to capture an web oriented ontology, like Yahoo, could serve as the foundation for this effort.

Merging Result List Scores

Fulcrum supports the development of a clear multi-vendor standard to allow document index collections to compare and rationalize the statistical weight of a query against the collection as a whole. The resulting collection weighting should be engineered to be a viable mechanism for subsequent normalization of the individual document relevance scores presented to the user.
Last modified: Thu Jun 20 18:20:11 EST 1996.