No Nasty Robots! [Was: Full-text indexing for WWW conference avail. ]

Daniel W. Connolly (connolly@hal.com)
Thu, 13 Oct 1994 00:39:48 -0500


In message <9410122131.aa07321@paris.ics.uci.edu>, "Roy T. Fielding" writes:
>Nick Arnett wrote:
>
>> The spider will hit your server fairly hard. We have a real-time indexing
>> engine and a T-1...
>
>This is just plain irresponsible. You are not only affecting their server,
>you will also effect every network connection between your site and theirs.
>People pay good money for that bandwidth -- you should not attempt to hog it.
>
>Your spider should be running on their local net -- running at your site
>provides no added value.

I expect this would require installing Verity's Topic at the various
information providers' sites. Not practical, I expect.

> At a minimum, the spider should be forced to
>delay between consecutive requests (about 15-30 seconds, depending on the
>network throughput and speed of the server).

When we at HaL built our CD ROM of abstracts of 10,000 web documents
(with links to the documents themselves, with our OLIAS browser on the
CD-ROM.. ask jps@hal.com for details), we implemented a "spider" that
visited the various sites in an order such that no site was visited
more than once per minute.

It was only a few hours of head-scratching and testing to get it to
work. Vince Taluskie <taluskie@utpapa.ph.utexas.edu> did the
implementation. I'm sure he wouldn't mind helping you out a little.
We already paid him to do it once -- I don't think he'd make you
pay him again for the same info ;-)

Vince consulted the published guidelines[1], I believe. You will not
please the net.folk if you blatantly disregard them.

Dan

[1] "Guidelines for Robot Writers"
Martijn Koster
http://web.nexor.co.uk/mak/doc/robots/guidelines.html