Multilingual Issues in WWW Indexing and Searching

Position paper for the W3C Distributed Indexing/Searching Workshop

Philip Resnik and Gary Adams
Sun Microsystems Laboratories


The World Wide Web is an international phenomenon, yet its infrastructure is at present ill equipped to help users deal with languages other than those with which they are familiar. With the advent of Unicode, browsers that seamlessly support the display of multiple languages are not far off, but thus far little has been done to address the issue of multilingual content. As things stand, most of the popular Web search engines do have pages in multiple languages appearing in their indexes, but they provide no multilingual support to speak of, either at indexing time, at search time, or by way of helping the user cope when confronted with foreign-language text. This position paper is intended primarily to flag some of the issues that need to be addressed if standards for distributed Web searching and indexing are to take seriously the multilingual nature of the World Wide Web.


Unless one adopts an IR framework based on character subsequences, indexing depends on the identification of meaningful units, typically word forms or word stems. Some key issues include the following:

Query Processing

In addition to the same set of issues that arises at indexing time, processing user queries also raises the following questions:

Conceptual Matching

"Conceptual" is something of a recent buzzword in the information retrieval business. Within a single-language setting, the general issue is locating text that might not use exactly the same words found in the query; for example, a search involving "agriculture" might do well to turn up documents about "farming". Multilingual retrieval is in a sense a generalization of this problem: a search for "computer science", viewing that term as a concept, should turn up instances of that concept even when expressed in another language, e.g. as "l'informatique".

Presentation Issues

If a search turns up hits in multiple languages, that still is not the end of the story: support must be provided for users who may not be familiar with all the languages they are faced with in response to a query. At Sun Labs, we have been working on a pilot project designed to help users "get the gist" of pages in unfamiliar languages, in order to decide whether to avail themselves of on-line opportunities for getting documents translated.


