UROP Mini-Course: Information gathering

AltaVista - Excite - Infoseek - Inktomi - Lycos - MetaCrawler - Starting Point - WebCrawler - WWW Virtual Library - Yahoo.

Information gathering

Introduction
How is the information collected ?
Search and cataloguing mechanisms

Introduction

The two major ways to search for information are browsing through catalogues or typing keywords in indexes.

Method Indexes Catalogues

Advantages

The fastest when searching technical, precise information (eg, IETF RFC 898)
The most exhaustive

Provides context
Presentation is usually more human-readable

Drawbacks

You have to do the sorting yourself
Keywords-dependant

Not very exhaustive
Classification-dependant

Example AltaVista The WWW Virtual Library

Method	Indexes	Catalogues
Advantages	The fastest when searching technical, precise information (eg, IETF RFC 898) The most exhaustive	Provides context Presentation is usually more human-readable
Drawbacks	You have to do the sorting yourself Keywords-dependant	Not very exhaustive Classification-dependant
Example	AltaVista	The WWW Virtual Library

However, most such sites offer both an index and a catalogue. The most famous of them, based on WebCrawler's "Top 25 Most Linked to Sites", are Yahoo, WebCrawler, Lycos, Infoseek, and Starting Point.

However, besides this difference in the way to retrieve information, there are major distinctions between the services that you will find on the Web, based on two factors: what is the information collecting process, and what happens after you submitted your keywords (for indexes) or selected a specific category (for catalogues).

What is the information-collecting process ?

Who is collecting the info ?

Humans or robots

From whom (or what) ?

From humans (through forms, emails, aso.) or from Web documents (eg HTML).

How do they collect it ?

Humans may create some sort of super home-page based of their past Web navigations, or collect the info through user-registration (email, form, aso).

Robots may automatically crawl around the web.

There are currently strong moves towards distributed indexing/searching, and W3C has recently organized a workshop on the subject (see also the Harvest project user's manual and a quick description of its model).

What information do they collect ?

Text-only; text + meta-information automatically retrieved (URL, last updated...); text + meta-information given by the subscriber, like description, list of keywords...

Search and cataloguing mechanisms

The major classification systems currently in use, like the Library of Congress, Dewey, or the Universal Decimal Classification (UDC) are about a century old, and it is rather complex to find new technologies there. In addition, they tend to be ethno-centric around their developpers...

A major step forward was taken by Ted Nelson. Indeed, in the 60's, he invented the term hypertext and developped the Xanadu project. Moreover, the development of computers (and their hardware components like CPU, physical memory, hard disks) is now allowing the construct of inverted indexes at unheard of scales.

Recently, the explosion of the World Wide Web has opened tremendous possibilities. Let's see some of what is here today. I will often refer to an article in Wired (May 96), entitled "Seek and Ye Shall Find (Maybe)".

A distributed catalogue: The WWW Virtual Library

About 200 volunteers take part in this project, but it is still far from covering all fields of information.

A catalogue: Yahoo!

Yahoo! has about 20 human classifiers that catalogue the URLs that are emailed to them or found by a robot.

Wired says they already can't keep up with the half-million sites existing today, and that they run into sensitive classification problems, like where to place the Messianic Jewish Alliance of America (MJAA members are born from Jewish mothers and believe Jesus Christ is the messiah).

Moreover, Wired points out that the sky-rocketting number of human classifiers required is threatening the consistency of Yahoo's point of view on classification.

An index: Inktomi

Inktomi is based over 4 Sun Sparc stations 10, each of them taking care of its part of the inverted index of about 10% of the web. There is no word-proximity information. Wired claims it is one of the cheapest indexes around in terms of hardware requirements (for them).

A deluxe index: AltaVista

Word proximity, several powerful computers (the "Web indexer" has 6 Gb of physical memory, 210 Gb hard disk, 10 processors, and is the most powerful machine built by DEC). According to Wired, AltaVista downloads and analyzes about 2.5 million documents per day (3 millions according to DEC ;-) , out of 21 million he knos of, out of the 30 million to 50 million documents contained in the Web.

One thing I personnally like with AltaVista is their "link:" possibility: it gives back all pages containing a link to the page specified!

One thing I don't like is that they don't index our server! (Because, as part of the DEC corporation, member of the W3 Consortium, they have access to privileged information :-( .

A latent semantic Index: Excite

By using statistical algorithms Latent Semantic Indexing can retrieve relevant documents even when they do not share any words with your query.

Wired explains that Excite makes heavy use of this technology to get rid of the problems of homonymy (same word, different meanings) and synoymy (different words, same meaning).

A cheater ;-) : MetaCrawler

MetaCrawler basically sends your request to all of the indexes described above, and more! One drawback, though, is that as it waits for replies from all these services, it is slow to give back results compared to most of them.

It was presented at the 4th WWW confrence in Boston, and this academic project offered by the University of Washington has another hidden advantage: it trashes away all the advertisement that most other indexes are trying to sell you...

A computer-built context: Oracle's ConText

ConText is aimed at giving the context of a document. According to Wired, it is not yet used in conjunction with any index, nor does it work well with poetry, litterature or metaphors...

Enjoy!

Up - Back to Mini-Course