Multilingual Issues in WWW Indexing and Searching

Position paper for the W3C Distributed Indexing/Searching Workshop

Philip Resnik and Gary Adams
Sun Microsystems Laboratories
philip.resnik@east.sun.com
gary.adams@east.sun.com

Introduction

The World Wide Web is an international phenomenon, yet its infrastructure is at present ill equipped to help users deal with languages other than those with which they are familiar. With the advent of Unicode, browsers that seamlessly support the display of multiple languages are not far off, but thus far little has been done to address the issue of multilingual content. As things stand, most of the popular Web search engines do have pages in multiple languages appearing in their indexes, but they provide no multilingual support to speak of, either at indexing time, at search time, or by way of helping the user cope when confronted with foreign-language text. This position paper is intended primarily to flag some of the issues that need to be addressed if standards for distributed Web searching and indexing are to take seriously the multilingual nature of the World Wide Web.

Indexing

Unless one adopts an IR framework based on character subsequences, indexing depends on the identification of meaningful units, typically word forms or word stems. Some key issues include the following:

Identifying the language of the text
Mixed-language documents: document-level vs. passage-level retrieval
How to segment text into words (e.g. Japanese)
Stemming and morphology (e.g. German compounds)
Punctuation conventions

Query Processing

In addition to the same set of issues that arises at indexing time, processing user queries also raises the following questions:

Character set issues when entering queries on forms
Restoring accents to query terms that omit them
Dealing with variant spellings

Conceptual Matching

"Conceptual" is something of a recent buzzword in the information retrieval business. Within a single-language setting, the general issue is locating text that might not use exactly the same words found in the query; for example, a search involving "agriculture" might do well to turn up documents about "farming". Multilingual retrieval is in a sense a generalization of this problem: a search for "computer science", viewing that term as a concept, should turn up instances of that concept even when expressed in another language, e.g. as "l'informatique".

Presentation Issues

If a search turns up hits in multiple languages, that still is not the end of the story: support must be provided for users who may not be familiar with all the languages they are faced with in response to a query.

Identifying the language of the hit
Optional filtering to exclude hits in unfamiliar languages
Alternatively, help for "getting the gist" in unfamiliar languages
Pointers to on-line solutions for translation

At Sun Labs, we have been working on a pilot project designed to help users "get the gist" of pages in unfamiliar languages, in order to decide whether to avail themselves of on-line opportunities for getting documents translated.

References

Andrew Pollack, "A Cyberspace Front in a Multicultural War", New York Times, August 7, 1995, page D1.
F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Internationalization of the Hypertext Markup Language", Internet draft draft-i etf-html-i18n-03.txt, February 13, 1996.
T. Berners-Lee, and D. Connolly, "Hypertext Markup Language - 2.0", Request for Comments 1866, MIT/W3C, November, 1995.
Co nceptual Indexing Fiscal 1995 Project Portfolio Report, November 1995.
Sun Microsystems Laboratories Knowledge Technology Group -- Conceptual Indexing Project home page .

This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.