W3C Semantic Web Activity Status - slide "Cloud Harvester"

seeded from daml crawler, schemaweb, etc

manual additions page

any URI used in a search

any URI encountered in data, with decreasing weight

respect robots.txt, maybe "robots.rdf"

http 1.1 caching (expires, max-age)

stores raw data, quickload parsed data, and headers

index of words and URIs used in data