Write Web Crawler

From W3C Wiki

Simple guidelines to help you write a Good Web Crawler (HTTP).

  • Read HTTP specification (RFC 2616).
  • Respect HTTP cache information such as If-Modified-Since, Last-Modified, etc.
  • Identify your crawler in the User-Agent HTTP header.
  • Respect robots.txt.
    • Add a delay of at least 1s between each query.
    • Obey any crawling speed limitations in robots.txt (Crawl-Delay).
  • When you need a lot of resources, including bandwidth, negotiate with the site owner to find an agreeable solution.

If you do not follow these simple rules, you might end up being blocked as a Bad Crawler.