Write Web Crawler
From W3C Wiki
Simple guidelines to help you write a Good Web Crawler (HTTP).
- Read HTTP specification (RFC 2616).
- Respect HTTP cache information such as If-Modified-Since, Last-Modified, etc.
- Identify your crawler in the User-Agent HTTP header.
- Respect robots.txt.
- Add a delay of at least 1s between each query.
- Obey any crawling speed limitations in robots.txt (Crawl-Delay).
- When you need a lot of resources, including bandwidth, negotiate with the site owner to find an agreeable solution.
If you do not follow these simple rules, you might end up being blocked as a Bad Crawler.