At WebCrawler we are pleased to see W3C take an active interest in the area of resource discovery. We believe that offering Web users an effective search experience requires increasingly more sophisticated information exchange between information providers and indexing systems. Practices to accomplish this will only gain critical mass if they are standardised and backed by the industry as a whole, and W3C could play a catalysing role.
The current generation of Web-wide indexing robots[1] all have to deal primarily with the same issues, which would benefit from increased communication between information providers, indexing services, and end users:
We look forward to discussing these and other issues further at the workshop.Avoiding indexing "bad" documents
This is partly addressed by the Standard for Robots Exclusion (SRE) [2]. The SRE has got some problems, for which we would like to suggest some solutions.
Finding "good" documents to index
This can be addressed by simple mechanisms based on the SRE or other server-centric mechanisms. On a document level, relationships between documents need to be identified.
Describing documents
The suggestion of a small standard set of meta-data on a document level (using META or LINK tags [3]) is an obvious and effective step we'd welcome. More elaborate rating schemes such as PICS could even address group ratings of resources, but are not readily deployed.Users searching for documents
While differentiation is important in the marketplace, the user would benefit from standard search mechanisms, such as query language constructs.Efficient indexing
Finally, mechanisms to aid the mechanics of indexing (such as Harvest) would be beneficial, but are likely to be slow in being deployed world-wide, and warrant separate consideration from the issues above.
Martijn Koster, Software Engineer, WebCrawler