Multilingual Issues in WWW Indexing and Searching
Position paper for the
W3C Distributed Indexing/Searching
Philip Resnik and Gary Adams
Sun Microsystems Laboratories
The World Wide Web is an international phenomenon, yet its
infrastructure is at present ill equipped to help users deal with
languages other than those with which they are familiar. With the
advent of Unicode, browsers that seamlessly support the display of
multiple languages are not far off, but thus far little has been done
to address the issue of multilingual content. As things
stand, most of the popular Web search engines do have pages in
multiple languages appearing in their indexes, but they provide no
multilingual support to speak of, either at indexing time, at search
time, or by way of helping the user cope when confronted with
This position paper is intended primarily to flag some of the issues
that need to be addressed if standards for distributed Web searching
and indexing are to take seriously the multilingual nature of the
World Wide Web.
Unless one adopts an IR framework based on character subsequences,
indexing depends on the identification of meaningful units, typically
word forms or word stems. Some key issues include the following:
- Identifying the language of the text
- Mixed-language documents: document-level vs. passage-level retrieval
- How to segment text into words (e.g. Japanese)
- Stemming and morphology (e.g. German compounds)
- Punctuation conventions
In addition to the same set of issues that arises at indexing time,
processing user queries also raises the following questions:
- Character set issues when entering queries on forms
- Restoring accents to query terms that omit them
- Dealing with variant spellings
"Conceptual" is something of a recent buzzword in the information
retrieval business. Within a
single-language setting, the general issue is locating text that might
not use exactly the same words found in the query; for example, a
search involving "agriculture" might do well to turn up documents
about "farming". Multilingual retrieval is in a sense a
generalization of this problem: a search for
"computer science", viewing that term as a concept, should turn up
instances of that concept even when expressed in another language,
e.g. as "l'informatique".
If a search turns up hits in multiple languages, that still is not the
end of the story: support must be provided for users who may not be
familiar with all the languages they are faced with in response to a
At Sun Labs, we have been working on a pilot
project designed to help users "get the gist" of pages in unfamiliar
languages, in order to decide whether to avail themselves of
on-line opportunities for getting documents translated.
- Identifying the language of the hit
- Optional filtering to exclude hits in unfamiliar languages
- Alternatively, help for "getting the gist" in unfamiliar languages
- Pointers to on-line solutions for translation
- Andrew Pollack, "A Cyberspace Front in a Multicultural War",
New York Times, August 7, 1995, page D1.
- F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Internationalization
of the Hypertext Markup Language", Internet draft draft-i
etf-html-i18n-03.txt, February 13, 1996.
- T. Berners-Lee, and D. Connolly, "Hypertext Markup Language - 2.0",
Request for Comments
1866, MIT/W3C, November, 1995.
nceptual Indexing Fiscal 1995 Project Portfolio Report, November 1995.
- Sun Microsystems Laboratories Knowledge Technology
Group -- Conceptual Indexing Project home page .
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.