WebID and Crawlers

From W3C Wiki

WebID can be used as an effective defense against abusive Web crawlers.

Traditional countermeasures are insufficient

The traditional measures described on Bad RDF Crawlers have been around since the beginning of Web crawling, and suffer from a number of problems:

  • IP addresses are very bad identifiers
    • they can be faked
    • a large number of users can sit behind a single IP address. In the early Web (1995-1998), most requests came through the few IP addresses of AOL's proxies
  • headers can be faked or forgotten
  • robots.txt works by convention only - it has no enforcement mechanism
    • robot writers need to know about it, and this is not always an evident thing to understand
    • not all users have access to robots.txt, so in any case it is not a very flexible mechanism for setting access control

These were perfectly fine in a world where there were few writers of robots and the computing power for running such tools was expensive, but they are no longer appropriate for a world where every laptop has more RAM and CPU than the largest machines search engines were running on in 1996.

Requirement: Strong distributed access control

What is required is strong and automatic access control, that works in a distributed manner. But global authentication is required for this to work. Otherwise, robots would need to find the login page for each web site and then create a username and password for that site, which is clearly an impossible task.

Global Authentication tied into Linked Data is enabled by FOAF+SSL, also known as WebID. Both HTTP and HTTPS resources can be protected this way:

  • HTTPS resources request client-side certificates according to the usual TLS protocol.
  • HTTP resources can redirect clients to an HTTPS endpoint for authentication if the requestor has no cookie. If the client does not have a WebID-enabled X509 certificate, OpenID or other methods of authentication can be used. Robots will find WebID the easiest to use, as all https libraries come with the required functionality. Once authenticated, clients (and hence robots) can then be redirected to the HTTP resources with a cookie, and proceed as usual using cookies.

Advantages of WebID

The advantages of WebID are many:

  • Robots and crawlers can identify themselves and describe themselves in detail in their WebID Profile document (ontology to be developed), and so get access to special resources more useful to robots, such as full dumps or RSS feeds.
 linky:spider a web:Crawler;
          doap:project <http://code.google.com/p/ldspider/>;
          foaf:name "LP Spider"; ...
  • Authentication is automatically enforced - so bad robot writers will very soon find out about it, as they won't get access to the data until they submit a valid WebID X509 certificate.
  • WebIDs are distributed and can preserve anonymity while enabling authentication. WebIDs can be self-generated and/or throw-away. There is no center of control.
  • Good WebID users can get better service (than those who don't use WebID) over time, leading even anonymously-identified robots to pursue a strategy of long-term good behavior.
  • Getting WebIDs is very easy, and most software libraries support client-side certificates, so it should only be a few hours work for robot authors to enable their crawlers with it.
  • Building WebID-enabled application servers is not that much work either.

The WebID Incubator Group

The WebID Incubator Group is very keen to work with robot writers and linked data publishers to help them WebID-enable their apps.

See Also