Bad RDF Crawlers
From W3C Wiki
This page is intended as a list of poorly behaving crawlers that target RDF-publishing websites.
This will allow publishers to defend themselves by blocking such crawlers based on blocking of user agents or IP ranges.
For background, see this public-lod thread: Think before you write Semantic Web crawlers
Best practices
Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. A well-behaved crawler must:
- use reasonable limits for default crawling speed and re-crawling delay,
- obey robots.txt,
- obey crawling speed limitations in robots.txt (Crawl-Delay),
- identify itself properly with the User-Agent HTTP request header, including contact information therein,
- avoid excessive re-crawling.
Incidents
To report a poorly behaving crawler, please provide at least the following information:
- Date of incident:
- What the crawler did wrong:
- User agent string:
- IP address range:
