Bad RDF Crawlers
This page is intended to hold a list of poorly behaving crawlers that target RDF-publishing websites, unfortunately a recurring problem. The list will allow publishers to defend themselves by blocking such crawlers.
Best practices for web crawlers
Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. It has bad consequences for the Web in general.
A well-behaved crawler …
- … uses reasonable limits for default crawling speed and re-crawling delay,
- … obeys robots.txt,
- … obeys crawling speed limitations in robots.txt (Crawl-Delay),
- … identifies itself properly with the User-Agent HTTP request header, including contact information therein,
- … avoids excessive re-crawling,
- … respect HTTP cache headers such as If-Modified-Since, Last-Modified and ETag when re-crawling.
See Write Web Crawler for further guidelines.
If you run large web servers, you may want to consider defensive measures against abuse and attacks.
On Apache web servers, mod_rewrite can be used to block bad crawlers based on their IP address or User-Agent string.
There are several sites dedicated to collecting and sharing information about bad web crawlers in general (not RDF-specific):
- BotTrap.de (in German)
To report a poorly behaving crawler, please provide at least the following information:
- Date of incident:
- What the crawler did wrong:
- User agent string:
- IP address range:
- Access logs (if possible):
- Think before you write Semantic Web crawlers, public-lod post by Martin Hepp, 21 June 2011