Difference between revisions of "Bad RDF Crawlers"

From W3C Wiki
Jump to: navigation, search
(Revert Karl Dubost edit. I think the list of bad crawlers is important. Made general improvements to page.)
Line 1: Line 1:
Some Web crawlers are badly designed most of the time by ignorance. Under how to write a [[Write_Web_Crawler|good Web crawler]] is essential for making all our work useful. Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege.
+
This page is intended as a list of poorly behaving crawlers that target RDF-publishing websites, unfortunately a recurring problem<ref>[http://lists.w3.org/Archives/Public/public-lod/2011Jun/0433.html Think before you write Semantic Web crawlers], public-lod post by Martin Hepp, 21 June 2011</ref>. The list will allow publishers to defend themselves by blocking such crawlers.
  
Publishers have to defend themselves by blocking abusive crawlers based on blocking of user agents or IP ranges. <ref>For background, see this public-lod thread: [http://lists.w3.org/Archives/Public/public-lod/2011Jun/0433.html Think before you write Semantic Web crawlers]</ref> It has bad consequences for the Web in general.
+
== Best practices for web crawlers ==
 +
 
 +
Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. It has bad consequences for the Web in general.
 +
 
 +
A well-behaved crawler …
 +
 
 +
* … uses reasonable limits for default crawling speed and re-crawling delay,
 +
* … obeys [http://www.robotstxt.org/robotstxt.html robots.txt],
 +
* … obeys crawling speed limitations in robots.txt ([http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-03.html Crawl-Delay]),
 +
* … identifies itself properly with the [http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.43 User-Agent HTTP request header], including contact information therein,
 +
* … avoids excessive re-crawling,
 +
* … respect [http://www.peej.co.uk/articles/http-caching.html HTTP cache headers] such as If-Modified-Since, Last-Modified and ETag when re-crawling.
 +
 
 +
See [[Write Web Crawler]] for further guidelines.
 +
 
 +
== Defensive measures ==
  
 
If you run large web servers, you may want to consider [http://code.google.com/p/ldspider/wiki/ServerConfig defensive measures] against abuse and attacks.
 
If you run large web servers, you may want to consider [http://code.google.com/p/ldspider/wiki/ServerConfig defensive measures] against abuse and attacks.
  
 +
On Apache web servers, [http://www.fleiner.com/bots/#banning mod_rewrite can be used] to block bad crawlers based on their IP address or User-Agent string.
 +
 +
There are several sites dedicated to collecting and sharing information about bad web crawlers in general (not RDF-specific):
 +
 +
* [http://www.bot-trap.de/ BotTrap.de] (in German)
 +
* …
 +
 +
== Incidents ==
  
== Sites listing bad crawlers ==
+
To report a poorly behaving crawler, please provide at least the following information:
  
* [http://www.bot-trap.de/home/ Bot trap]
+
* Date of incident:
 +
* What the crawler did wrong:
 +
* User agent string:
 +
* IP address range:
 +
* Access logs (if possible):
  
 
== References ==
 
== References ==
<references/>
+
<references />

Revision as of 15:34, 23 June 2011

This page is intended as a list of poorly behaving crawlers that target RDF-publishing websites, unfortunately a recurring problem[1]. The list will allow publishers to defend themselves by blocking such crawlers.

Best practices for web crawlers

Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. It has bad consequences for the Web in general.

A well-behaved crawler …

  • … uses reasonable limits for default crawling speed and re-crawling delay,
  • … obeys robots.txt,
  • … obeys crawling speed limitations in robots.txt (Crawl-Delay),
  • … identifies itself properly with the User-Agent HTTP request header, including contact information therein,
  • … avoids excessive re-crawling,
  • … respect HTTP cache headers such as If-Modified-Since, Last-Modified and ETag when re-crawling.

See Write Web Crawler for further guidelines.

Defensive measures

If you run large web servers, you may want to consider defensive measures against abuse and attacks.

On Apache web servers, mod_rewrite can be used to block bad crawlers based on their IP address or User-Agent string.

There are several sites dedicated to collecting and sharing information about bad web crawlers in general (not RDF-specific):

Incidents

To report a poorly behaving crawler, please provide at least the following information:

  • Date of incident:
  • What the crawler did wrong:
  • User agent string:
  • IP address range:
  • Access logs (if possible):

References

  1. Think before you write Semantic Web crawlers, public-lod post by Martin Hepp, 21 June 2011