Difference between revisions of "Bad RDF Crawlers"

From W3C Wiki
Jump to: navigation, search
(Stronger defences using WebID)
Line 1: Line 1:
This page is intended as a list of poorly behaving crawlers that target RDF-publishing websites, unfortunately a recurring problem<ref>[http://lists.w3.org/Archives/Public/public-lod/2011Jun/0433.html Think before you write Semantic Web crawlers], public-lod post by Martin Hepp, 21 June 2011</ref>. The list will allow publishers to defend themselves by blocking such crawlers.
+
This page is intended as a list of poorly behaving crawlers, unfortunately a recurring problem<ref>[http://lists.w3.org/Archives/Public/public-lod/2011Jun/0433.html Think before you write Semantic Web crawlers], public-lod post by Martin Hepp, 21 June 2011</ref>. The list will allow publishers to defend themselves by blocking such crawlers.
  
 
== Best practices for web crawlers ==
 
== Best practices for web crawlers ==

Revision as of 17:02, 23 June 2011

This page is intended as a list of poorly behaving crawlers, unfortunately a recurring problem[1]. The list will allow publishers to defend themselves by blocking such crawlers.

Best practices for web crawlers

Dereferencing is a privilege, not a right. Crawlers that don't use server resources considerately abuse that privilege. It has bad consequences for the Web in general.

A well-behaved crawler …

  • … uses reasonable limits for default crawling speed and re-crawling delay,
  • … obeys robots.txt,
  • … obeys crawling speed limitations in robots.txt (Crawl-Delay),
  • … identifies itself properly with the User-Agent HTTP request header, including contact information therein,
  • … avoids excessive re-crawling,
  • … respect HTTP cache headers such as If-Modified-Since, Last-Modified and ETag when re-crawling.

See Write Web Crawler for further guidelines.

Defensive measures

If you run large web servers, you may want to consider defensive measures against abuse and attacks.

On Apache web servers, mod_rewrite can be used to block bad crawlers based on their IP address or User-Agent string.

There are several sites dedicated to collecting and sharing information about bad web crawlers in general (not RDF-specific):

Stronger defences using WebID

The above measures have been around since the beginning of Web crawling, and suffer from a number of problems

  • ip addresses are very bad identifiers
    • they can be faked
    • a large number of users can sit behind an IP address. In the early Web (1995->1998) most adressess came through AOL proxies.
  • headers can be faked or forgotten
  • robots.txt works by convention only - it has no enforcement mechanism
    • robot writers need to know about it, and this is not always an evident thing to understand
    • not all users have access to robots.txt, so in any case it is not a very flexible mechanism for setting access control

Where these were perfectly fine in a world where there were few writers of robots and computing power for running such tools was expensive, they are no longer appropriate for a world where every laptop has more RAM and CPU than the largest machines search engines were running on in 1996. What is required is strong and automatic access control, that works in a distributed manner. But for this to work requires global authentication. Otherwise robots would need to find the login page for every web site and create themselves a username and password for that site, which is clearly an impossible task.

Global Authentication tied into Linked Data is what is enabled by foaf+ssl also known as WebID. Both http and httpS resource can be proteced this way

  • httpS resources request client side certificates as per the usual WebID protocol
  • http resources can use cookies, and redirect clients to an https endpoint for authentication if the requestor has not cookie. If the client does not have a WebID enabled certificate, OpenId or other methods of authentication can be used. Once authenticated clients and hence robots can then be redirected to the http resources and proceed as usual.

The advantage of WebID are many:

  • robots and crawlers can identify themselves as such by describing themselves as :Crawler in their WebID Profile document (ontology to be developed), and so get access to special resources more useful to robots, such as full dumps or rss feeds.
  • authentication is automatically enforced - so bad robot writers will very soon find out about it, as they won't get access until they do
  • WebID are distributed and can while preserve anonymity whilst enabling authentication: WebID's can be self generated and throw-away. There is no center of control.
  • Good WebID users can over time get better service leading even anonymously identified robots to seek to have a strategy of long term good behavior
  • getting WebIDs is very easy, and most libraries support client side certificates, so it should only be a few hours work for robots to enable their crawlers with it
  • building WebID enabled application servers is not that much work either.

The WebID Incubator Group is very keen to work with robot writers and linked data pulbishers to help them WebID enable their apps.

Incidents

To report a poorly behaving crawler, please provide at least the following information:

  • Date of incident:
  • What the crawler did wrong:
  • User agent string:
  • IP address range:
  • Access logs (if possible):

References

  1. Think before you write Semantic Web crawlers, public-lod post by Martin Hepp, 21 June 2011