Microsoft is interested in working with the Internet community to create standard conventions, APIs and protocols for distributed indexing and distributed query. At present we'd like to focus on distributed indexing as we feel it is a more tractable problem.
Some issues that we feel are important to address at this conference are:
1. A crawler needs a way to find a list of documents that have changed since its last visit.
2. End users want rich query functionality using full text, sentence and paragraph proximity, tagging information, etc. Administrators want to minimize use of bandwidth and system resources. How can we balance these conflicting goals in designing an interchange format? Anything but full text will cause indexers to lose information. Do we need formatting decoration? To what extent? To what standard?
3. How do we represent embedded information in an interchange format. With what markings, if any, to indicate that it was embedded (or linked)?
4. What is the minimum property set for an interchange format? We believe there should be one and an HTML syntax should be defined for it. The Dublin metadata set is a promising start..
5. Is HTML or a related DTD a suitable interchange format? SGML is attractive because it is well-known and parsers are available. Such a solution can take advantage of other work such as the proposal to represent the Dublin Core in HTML.
Is there interest in a web-crawling consortium? This could take the form of a non-profit corporation, a for-profit corporation, an agreement to split the crawling problem between existing organizations. We are interested in discussing this.