Difference between revisions of "Bad RDF Crawlers"

From W3C Wiki
Jump to: navigation, search
(Created page with "This page is intended as a list of poorly behaving crawlers that target RDF-publishing websites. This will allow publishers to defend themselves by blocking such crawlers based …")
(No difference)

Revision as of 21:41, 22 June 2011

This page is intended as a list of poorly behaving crawlers that target RDF-publishing websites.

This will allow publishers to defend themselves by blocking such crawlers based on blocking of user agents or IP ranges.

For background, see this public-lod thread: Think before you write Semantic Web crawlers

Best practices

Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. A well-behaved crawler must:

  • use reasonable limits for default crawling speed and re-crawling delay,
  • obey robots.txt,
  • obey crawling speed limitations in robots.txt (Crawl-Delay),
  • identify itself properly with the User-Agent HTTP request header, including contact information therein,
  • avoid excessive re-crawling.

Incidents

To report a poorly behaving crawler, please provide at least the following information:

  • Date of incident:
  • What the crawler did wrong:
  • User agent string:
  • IP address range: