A position paper for the Distributed Indexing/Searching Workshop
Submitted by Cecilia Preston (email@example.com)
I do not have a web site established where I could give this a URL as requested in the call.
With the explosive growth of networked information systems available it is becoming evident that even in this brave new world, the automatic indexing of large collections of information can be handled somewhat effectively. But, many major issues arise from the simple fact that there are a vast array of domains all with their own vocabulary, and that the same ëstringí can represent very different concepts depending on the domain. For example, the Final Jeopardy category this evening was DATES. What DATES: the fruit, a notation of time, or part of the courting ritual in North America?
To make web crawlers and other forms of automatic indexing systems more useful, these larger issues need to be taken into consideration.
1) English (American) is not the only language on the planet. The work of the IAB character set workshop in March of this year will frame some of the issues that can be addressed in Internet standards that should allow for a consistent methods for specifying the language of use.
2) Given #1 above, how can the knowledge of language be incorporated into crawlers etc.? t-h-e in English would most likely be overlooked as having no content, but in French the same three letters t-h-e (given loss of the accent that often occurs) has content. Even in the same language (the DATE example above) context is required to make sense of meaning.
3) Vocabulary is used as a shorthand within a discipline to perform a number of functions a) signify that you are a member of the community b) reduce an entire concept or structure to a word or phrase.
4) Metadata is as specific to a discipline as the vocabulary. Very generalized metadata such as the Dublin Core provides a minimal set of elements which are in almost all networked information systems; such as author (someone or some other legal entity is responsible for the creation of the data, and putting it in this form), a title (the shorthand for the entire work) a date of creation or production, etc. All elements that allow for a quick scan for relevance without pulling the entire object or document.
These and other related issues must be taken into account for the standards that are going to be developed which will allow networked information systems to provide both humans and machines the information being sought, without overloading any system. No one standard or small group of standards will be capable of handling this task. The interoperation of standards developed by the many constituencies who are best positioned to describe their data must occur.