Bad RDF Crawlers

From W3C Wiki
Revision as of 21:41, 22 June 2011 by Rcygania2 (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

This page is intended as a list of poorly behaving crawlers that target RDF-publishing websites.

This will allow publishers to defend themselves by blocking such crawlers based on blocking of user agents or IP ranges.

For background, see this public-lod thread: Think before you write Semantic Web crawlers

Best practices

Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. A well-behaved crawler must:

  • use reasonable limits for default crawling speed and re-crawling delay,
  • obey robots.txt,
  • obey crawling speed limitations in robots.txt (Crawl-Delay),
  • identify itself properly with the User-Agent HTTP request header, including contact information therein,
  • avoid excessive re-crawling.


To report a poorly behaving crawler, please provide at least the following information:

  • Date of incident:
  • What the crawler did wrong:
  • User agent string:
  • IP address range: