Serving Multilingual Online Documentation

Faith Zack
The Santa Cruz Operation, Inc
400 Encinal Street
Santa Cruz, CA 95061
Phone: (408) 427-7611
Fax: (408) 427-5417
email: faithz@sco.com

Abstract

Although the HTTP 1.1 Draft Internet Standard introduces the Accept-Language header to support serving multiple languages, the WWW community has been slow to implement this feature. Few browsers send Accept-Language because most servers don't do anything with it. Most servers don't recognize Accept-Language because it isn't supported by many browsers. We at SCO have implemented our help, online documentation and manual pages using WWW architecture, and because we control both the client and server sides, we were able to serve multilingual documentation despite the lack of standards at the time of our implementation. We were able to resolve many multilingual issues that might serve as lessons for a more general solution. This paper hopes to serve as motivation for the WWW at large to begin implementation of a true multilingual Web.

Introduction

The online documentation system provided with SCO's OpenServer system includes context-sensitive help for both graphical and character applications, reference documentation (man pages), and extensive online documentation such as user guides, tutorials, and programmer's guides. All documentation is delivered online in HTML.

SCO signed a source license agreement in 1993 with NCSA for both NCSA Mosaic and NCSA HTTPD. Both were modified to customize them for online documentation and context-sensitive help. Support for Unix Domain Sockets was added so the WWW model worked even on a standalone non-networked machine. We run Scohttp on a separate well-known port (457) so it does not conflict with other WWW servers.

Backward compatibility for locating man pages is provided by a cgi-binary that encompasses these traditional heuristics. The man command line utility was rewritten to access this script using the WWW. The man pages are also available from within the Scohelp browser.

Full-text search indices were created for each book or man page section, and shipped along with a "stanza" file that maps filenames to titles for the book. This allowed the search cgi-binary to return search results that are topic-oriented rather than filename oriented, without the overhead of opening each file to parse its title. A library stanza file serves as a database of books for full-text searching and is passed to the client on the first search request.

When we designed our online help system, the WWW was in its infancy, and many areas had not been fully specified. In particular, support for multiple languages was very sketchily described. Since then the standards have evolved to describe more details, but still we felt somewhat on our own to apply them to a full multilingual solution.

Multilingual extensions

Because of our world wide market, it was essential that we be able to transparently deliver online documentation in the language given by the user's LANG environment variable. Because of the high cost of translation, not all the documentation would be translated, so we needed to provide one level of language fallback (to the English documentation). Different localizations of the product would translate different documentation, and generally only end-user documentation would be translated. A primary goal was to keep URLs language-independent. That is, all translated versions of a document would be referred to by the same URL. This would simplify hotlinks and allow more translated documentation to be added to the system without requiring cross-document references to change.

In order for this architecture to seamlessly integrate multiple languages, we made extensions and modifications to the Mosaic-based browser, the HTTP protocol, the HTTP server, and the CGI interface.

Accept-Language

Accept-Language is defined in the HTTP 1.1 [1] specification, although our implementation was based on earlier less-specific versions of this specification. Accept-Language is a new request header in the protocol and provides a mechanism for passing locale information to Scohttpd. The current specification for Accept-Language indicates that the language tags are defined by RFC 1766 [2], but at the time of our implementation this standard had not been given. We used the Unix locale names as our language tags, but provided a mapping mechanism for flexibility.

The user sets their preferred locale for their entire session with an environment variable, LANG, whose components control collation, currency notation, numeric formatting, and time preferences in addition to message language. The message language component of LANG is LC_MESSAGES. The user's LC_MESSAGES value is sent by Scohelp as part of each request to the server.

Scohelp sends only a single language preference. The HTTP 1.1 mechanism for associating an optional quality value to each language was not in the original specification, and was more refined than our needs dictated. Our implementation is consistent with the default behavior given in the HTTP 1.1 specification. Scohttpd will automatically fallback to a default (English) if the browser request doesn't include an Accept-Language, if the locale requested doesn't exist, or if the requested topic is not found within that locale.

We also made modifications in the server to send error messages back in the user's language, based on Accept-Language.

LanguageAlias

Because the Accept-Language values were not well defined at the time of our implementation, we built a mechanism for mapping Accept-Language values to a format understood by the server. LanguageAlias is a server configuration directive in the configuration file srm.conf that provides a simple mapping from language tags to server locales. The format of this tag is:

LanguageAlias:  <language tag>  <server locale>

We use the tag to map common locale names to a single naming convention on the server, as shown in this example:

LanguageAlias: french_france 	fr_FR.ISO8859-1 
LanguageAlias: fr_FR 		fr_FR.ISO8859-1 
LanguageAlias: german_germany 	de_DE.ISO8859-1 
LanguageAlias: de_DE 		de_DE.ISO8859-1

LocalizedDocRoot

DocumentRoot is a server directive that gives the implicit root for locating URLs, and is used if no Accept-Language value is given. To allow reuse of URLs among different languages, we implemented parallel language trees for each translation. The translation trees that we deliver are always a sparse subset of the English default. For convenience, the translated trees are physically subdirectories of the English tree, but this is not required. A server configuration directive LocalizedDocRoot in srm.conf maps the Accept-Language tag to a locale-specific root. If the document requested does not exist in this locale, then the server will use the default document root to locate the file. For example, if Accept-Language=french_france, the URL will be located in /usr/lib/scohelp/fr, but if not found the server will try again to locate it in /usr/lib/scohelp before returning a Not Found error message. Only the first language tag is searched, others will be ignored in our current implementation. The format of this tag is:

LocalizedDocRoot:  <server locale>  <pathname>

Here is a sample configuration of LocalizedDocRoot:

DocumentRoot: /usr/lib/scohelp
LocalizedDocRoot: fr_FR.ISO8859-1 /usr/lib/scohelp/fr
LocalizedDocRoot: de_DE.ISO8859-1 /usr/lib/scohelp/de

All Language Aliases are expanded before the LocalizedDocRoot is searched.

Common Gateway Interface Extensions

Two new environment variables were added to the Common Gateway Interface (CGI) so that the server can pass locale information to any CGI scripts that it executes (such as the search program or man).

LANG_PATH_TRANSLATED contains a translated version of PATH_INFO with any locale specific document roots expanded. It is essentially the same as PATH_TRANSLATED, except that it uses the locale specific document root (if it exists) instead of the default document root.

LANG gives us the ability to localize CGI script messages in the user's preferred language. LANG contains the locale specified in the Accept-Language header, with language aliases expanded.

Serving Multilingual Documents

Our documentation is structured with the translated directories as subdirectories of the home directory. Because of the fallback mechanism, many books were not translated, but can easily be added as more documentation is translated.

We currently have localized any French and German interfaces and documentation, but can support any of the ISO8859-1 language. Of course, the real cost of these localizations is the translation effort. Other ISO8859 languages can be supported if the appropriate font resources are installed, but the user cannot change the locale or codeset dynamically.

Documentation Scope

The user's locale determines the user interface language (we have localized Scohelp, libwww, and Scohttpd interfaces for French, German, Chinese, and Japanese) and in addition, defines a "documentation scope". This scope defines what documentation the user will see and have access to in the Table of Contents, Full-text Searching, and all navigation. Although we may ship and install multiple languages, the user sees only their language and the fallback language (which is English).

Figure 1 shows the Table of Contents that is shown to a user whose LANG is not set and so displays the English fallback text.

Figure 1: English Table of Contents

Our documentation uses a multi-tiered Table of Contents scheme. The top tier, shown in Figure 1, presents certain key books and "sets" that group the remainder of the books into primary subjects. The Table of Contents for some sets, such as The Graphical Environment Documentation Set, contain a mixture of both French and English book titles. Books themselves are always completely translated. We decided to translate all the set titles, but to only translate a book's title if the book itself was translated. All the key books presented at the top level Table of Contents were translated, so, for example, a French user sees an entirely translated Table of Contents, as shown in Figure 2.

Figure 2: French Table of Contents

The last line of Figure 2 informs the user that not all of the documentation has been translated. We originally considered marking translated books, but we found in practice that multilingual readers can tell by reading the titles if the book has been translated.

We use relative URLs in all of our documentation, because it is intended that the user be able to optionally move the entire documentation set off the desktop and onto a central documentation server, accessed via the WWW. The URLs on the Table of Contents pages for both languages are identical, as Figures 3 and 4 show.

Figure 3: English Table of Contents HTML Source

Figure 4: French Table of Contents HTML Source

Full-text Searching

Because our documentation is a heterogeneous mix of languages, we had to ensure that our full-text search engine searches only the books within the user's language scope.

Each language directory has a stanza file that describes the books and sets that have been translated and installed. This file is merged with the default library stanza file whenever any documentation is added or removed from the system. A merged library stanza file is maintained for each language on the system, but accessed only if Accept-Language is sent.

The first time the search dialog is accessed, a request for the library stanza file is made to the server. The merged library stanza file for appropriate to the Accept-Language is sent and parsed to show the books and sets installed (see Figure 5). The list is kept by the browser for the entire session, and can be used to tailor the scope of full-text searching to specific books and/or sets.

Figure 6: English full-text search dialog

A French user will see the mix of translated and untranslated book titles, as shown in Figure 6.

Figure 6: French full-text search dialog

When the French user searches for a French word, she will get hits generally only in the French documentation. If she searches for an English word, it will generally be found in the English documentation. But many technical terms, and words that are common between both languages, will be found in both English and French documentation. The search results for such a word will contain both English and French topics, as shown in Figure 7.

Figure 7: Result of full-text search for http

Serving Asian Documents

Although there is a specification for an Accept-Charset in the HTTP 1.1 specification, this solution may not be adequate to solve the problem of multilingual WWW content, since there is no mechanism in the response to identify which character set the document is actually encoded in. Because of our closed model, we were able to sidestep the character set issue somewhat. Our Asian localization of Scohelp supports the multibyte encoding Extended European Code (EUC) in addition to the other multilingual extensions discussed above. We are able to make the assumption that a user viewing a non-ISO8859-1 document has a browser configured with the fonts necessary to display it. So a document is sent with no header fields or tags to indicate its character or code set, and the browser displays it in the current font. As long as the browser has the correct fonts installed, the document displays correctly, as shown in Figure 9.

Figure 9: Chinese Scohelp document

Until we devise a method to change fonts based on the character set of the document, this is not a truely multilingual solution. For example, it is impossible to read say, a French document from a browser that is displaying Chinese fonts. This approach is barely acceptable for our needs, and we hope the standards will soon point the way to a more complete implementation.

The Challenges Ahead

A feature we would like to implement in a future version of Scohelp is a menu option so the user can change their language dynamically. Some users may prefer to read some documentation in their native language and other documentation in its original form. We would also like to extend the fallback mechanism to support more than one language incorporating the quality values described in the HTTP 1.1 specification.

Our focus is moving away from our closed model, to one where any browser can make use of our extensions. We are considering using a Netscape browser with its frames plug-ins and perhaps JavaScript to implement our user interface extensions. Without the benefit of a source license on other platforms, we need to rely more heavily on the WWW implementation of multilingual standards. Good standards are emerging, but our implementation reveals further needs.

References

[1] HTTP 1.1 Internet Draft, R. Fielding, H. Frystyk, T. Berners-Lee. January 19, 1996.

[2] "Tags for the identification of Languages" RFC 1766, H. Alvestrand. UNINETT, March 1995