Representations of URLs by Web Search Services

Erik Selberg & Oren Etzioni

Current global Web search services, such as Lycos and Alta Vista, are unable to provide comprehensive coverage of the Web. One solution to this has been the use of meta-search services which query each base service, such as MetaCrawler and SavvySearch. While the economic issues of a meta-search site can be resolved amicably and profitably between the meta-service and the base service, there are still some detailed technical issues which should be addressed.

Multi-service search services are able to collate results from many different search services, such as Lycos or Alta Vista. One of the many challenges faced by such meta-services is that each base service represents the contents of its database in a different manner. In order to compensate for this, meta-services must employ a variety of heuristics and custom code in order to collate information in a manner appropriate for users. This is a problematic approach, as it is not robust to changes in the base servers' representations, as well as being a wasteful approach, as often the meta-engine must compute information about the data, possibly by downloading it from a congested network, which the base service could have provided.

Most global Web search services use a confidence score as their only indication of relevance to the user's query. This score is just a number --- most services use natural numbers from [0 .. 1000] with 1000 being a "perfect match." Meta-search engines will typically normalize the score, and rank based upon a summation of that score. This method has problems, in that one service's notion of a "high score" is dramatically different than anothers. For example, given the query "Used Car," one service may give a high score based upon the word "Car" appearing in the title, whereas another will give an equally high score because "Used Car" appears somewhere in the body text.

What is needed is a richer formulation of the results returned by search services. This representation should include things such as:

Number of terms matched;
Number of terms matched on complete word;
Semantic location(s) of matched terms (e.g. in <h2> block);
Physical location(s) of matched terms (e.g. char 123 out of 1024);
Unique identifier for URL contents (e.g. MD5 checksum);
Date added to database;

Further, the ability to obtain information on the metric used for calculating ranks should available, as well as the ability to obtain information as to why particular URLs were excluded. For example, it should be possible to query a search engine with a query text and URL, and ask why the URL wasn't returned with the results of the query text.

These features, and undoubtably others, are needed in order to enable meta-search services to perform as well as they are able. Without this information, meta-search engines either need to infer the data which wastes computation time, download the page and extract the information which wastes network bandwidth, or do without, which produces less than optimal results. The obvious solution is create a standard representation which allows search services to convey the most information about their results to their users, be they human or artificial.

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.