Subject Indexing on the Web

David W. Robertson

Member of Technical Staff, ConText Server Group

Oracle Corporation

Current distributed indexing and search mechanisms often suffer from a lack of precision in results returned from a query. Searches are beginning to be based on the concepts as well as words present within a document. Searches on concepts increase the ratio of desired documents to irrelevant documents returned.

One way that precision has increased is through the use of subject catalogs. Typically, the main themes of a document are determined by a human cataloger, instead of automatically. Manually determining the subjects of a large percentage of documents on the Web is a daunting proposition.

The ConText option included in Oracle7 allows automatic classification of documents by subject. Indices reside in a standard Oracle database, providing security and fault tolerance. Automatic classification is made possible by natural-language processing using an extensive dictionary, and helps to solve the problems discussed above.

Another problem in obtaining relevant information is the stress that search engines place on Web servers and the network in building their databases. The Harvest System addresses efficiency concerns in Web crawling and searching through the concepts of Harvest Gatherers and Brokers.

ConText would be a useful commercially available index/search back/end for those using Harvest Brokers and for those using SOIF. Can Harvest be extended to provide adequate support for subject attributes, through SOIF and the Essence subsystem, for use with search engines such as ConText. For example, ConText can extract the major themes of a document. Provisions for attributes specifying multiple subject headings would be useful. Support for mapping subject headings from one classification system to another would also be helpful.

