Bad RDF Crawlers

From W3C Wiki

This page is intended to hold a list of poorly behaving crawlers that target RDF-publishing websites, unfortunately a recurring problem.[1] The list will allow publishers to defend themselves by blocking such crawlers.

Best practices for web crawlers

Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. It has bad consequences for the Web in general.

A well-behaved crawler …

  • … uses reasonable limits for default crawling speed and re-crawling delay,
  • … obeys robots.txt,
  • … obeys crawling speed limitations in robots.txt (Crawl-Delay),
  • … identifies itself properly with the User-Agent HTTP request header, including contact information therein,
  • … avoids excessive re-crawling,
  • … respect HTTP cache headers such as If-Modified-Since, Last-Modified and ETag when re-crawling.

See Write Web Crawler for further guidelines.

Defensive measures

If you run large web servers, you may want to consider defensive measures against abuse and attacks.

On Apache web servers, mod_rewrite can be used to block bad crawlers based on their IP address or User-Agent string.

WebID has been proposed as a stronger, simpler and fairer defense against over-eager crawlers: WebID and Crawlers.

There are several sites dedicated to collecting and sharing information about bad web crawlers in general (not RDF-specific):

Incidents

To report a poorly behaving crawler, please provide at least the following information:

  • Date of incident:
  • What the crawler did wrong:
  • User agent string:
  • IP address range:
  • Access logs (if possible):

References

  1. Think before you write Semantic Web crawlers, public-lod post by Martin Hepp, 21 June 2011