by Daniel LaLiberte, NCSAThere continues to be commercial value in attempting to index everything on the web, but estimates are that only a modest percentage of the actual documents have been indexed, while growth continues at an exponential rate. Meanwhile, users around the world search in these centralized indexes, or replicas of them, and the better search servers quickly become overloaded, thus reducing their effectiveness. Furthermore, the indexing that is done is very generic, i.e., not at all specific to the domain areas of the documents, thus reducing their usefulness. All these market pressures dictate that search servers will eventually need to specialize in various ways to hold on to an increasingly narrow niche. Automatic searching through a web of interlinked, related, domain specific indexes may be done many different ways. First, I describe the basic design of a centrally controlled search process, and then consider a distributed search process. Both types of search processes should be considered by those creating internet standards in support of searching; extensions to HTTP are briefly suggested.
Centrally controlled searching starts by evaluating a query at some set of nodes and continues on neighboring nodes. How the neighboring nodes are selected determines the overall behavior of the search. Neither a depth-first nor breadth-first search is generally appropriate because both effectively search blindly. Rather, we should continue to search only the parts of the web that have some promise of satisfying the query, so a measure of relevance for each match is required. Given a set of nodes with varying degrees of relevance, the neighbors of the most relevant nodes should be searched next. Furthermore, each link to a neighboring node should specify the relationship to the node and include a description of it; this info may be used in the relevance measure.
The basic query capability could be supported in HTTP with a SEARCH request given a URI for a collection to be searched (or '*' for the whole site), and parameters of the query in header lines. The PEP Protocol header would specify the type of query being used. The result of the request would usually be metadata for the resulting collection, or perhaps a small collection could be returned in a multi-part result. Metadata about collections and elements of collections may be fetched (for example, to be incorporated into a searchable index) by using a META request. A META request specifies the URI of the collection or elements and header lines to indicate which type of metadata is desired.
Centralized control of searching seems to be required to select the nodes to be searched, perhaps concurrently, and to merge the results, but now we consider how searching can be done with no central control. By distributing control, we gain much greater processing power with each remote server performing part of the search, and we avoid the bottleneck of a centralized controller managing the entire process. The results must ultimately come back to the originator, but even this merging process can be decentralized.
A distributed searching algorithm must overcome several problems. To avoid looping, queries MUST be identified and each node MUST remember the queries it has recently proccessed. Also, a query MUST include both a time-to-live and a maximum distance to limit its range in time and space.
Results from matching a query at a node could be sent directly back to the originator; this requires that the originator be available as part of the query, but moreover, the originator may be overloaded by too many incoming results. Better would be to merge the results along the reverse path of the query. Or perhaps nodes at local maximums (that match the query well) could be given responsibility to merge results of their neighborhood, which are returned to ancestor local maximums and ultimately to the originator. This last requires asynchronous notification of each merge node, which should wait up until the time-to-live before returning the merged results.
A problem for both centralized and distributed searching is that a query must be transformed into one that the receiving node understands, and similarly the results must be transformed on the way back out. This assumes heterogeneity in the search protocols available across the web, which I believe will always be the case because searching is the kind of problem that can never be sufficiently standardized for all search domains without going to the full generality of arbitrary function calls.