Challenges of the Web to Libraries with Multilingual and Multiscript Collections

Randall K. Barry
Senior Network Standards Specialist
U.S. Library of Congress
Network Development and MARC Standards Office

2002-01-10

The Library Experience

Libraries have served as important sources of information for society for many millennia. From the very beginning, librarianship has focused attention on the organization of the materials forming collections. In more recent times, the dissemination of information about those collections has grown in importance, to the point where it sometimes rivals the collections themselves. In our new millennium librarians are witnessing the transformation of the information about their collections, the so-called "bibliographic metadata", into a commodity in and of itself.

Bibliographic metadata is particularly rich and diverse. More than any other body of information, it tends to be multilingual and requires the handling of multiple (non-latin) scripts. This is especially true of bibliographic metadata created by large libraries such as the U.S. Library of Congress. The Library of Congress (LOC) has disseminated its bibliographic metadata (cataloging) since 1898, first in the form of printed cards and catalogs, and eventually in machine- readable form using a standard called, MARC. Libraries and other users of this data have been able improve the control of their own collections while reducing the cost of doing so, thanks to LOC's ability to share its bibliographic metadata effectively. For most of the last century, LOC bibliographic metadata has included data in Latin and many other scripts. With the introduction of computers into libraries, the encoding of this bibliographic metadata had to be limited, initially due to limitations in computers' capacity to handle character sets. LOC mitigated this problem by deciding to transliterate nonlatin data into the Latin script for some languages (introduced as a temporary measure in the late 1970s until technology allowed it to return to metadata collection in the vernacular (nonlatin) script).

Information about materials in seven languages that do not use the Latin script (namely, Japanese, Arabic, Chinese, Korean, Persian, Hebrew, and Yiddish--also referred to collectively as "JACKPHY") have benefitted from an interim library solution that encoded both the original nonlatin data AND fully transliterated (into Latin script) alternate representations for those computer systems that could not handle the original scripts. The character sets used for both Latin and nonlatin bibliographic metadata are somewhat unique to libraries (with the exception of ASCII data, which forms the core of most library data). This situation has been less than ideal for integrating bibliographic metadata into a wider digital universe.

The advent of the universal character set (ISO/IEC 10646 and Unicode) has finally presented libraries with a potential path for moving from the limiting bibliographic (library) character sets and systems that have implemented them, to a global standard embraced by virtually all sectors of automation.

Special Library Needs

Libraries such as the Library of Congress have large databases of bibliographic metadata including Latin and nonlatin characters that cannot always be handled well via the Web. Libraries have been active in creating Web-based interfaces to their internal databases, but even Latin data presents particular problems where interface with the Web is concerned. The encoding of modified (e.g., accented) Latin letters in bibliographic metadata, is generally handled with nonspacing characters representing the modifying marks in library data. This is not typically the approach taken for other kinds of data. Since much of the currently popular Web technology embraces the alternative (non-library) approach of using precomposed letter-with-modifier characters for Latin data, this has made it more difficult for libraries to develop interfaces that can be integrated with non-library interfaces on the Web. Libraries must often choose between mapping their encodings of diacritical marks to other encodings, or dropping the diacritical marks altogether in their Web interfaces and OPACs (Online Public Access Catalogs).

Another typical characteristic of bibliographic metadata is the degree to which language and script can be mixed within a single document or record. This is not as common in other kinds of data, where only one or two languages and scripts per document or record is more common. (8-bit environments tended to support one or two scripts, but not more.) The problem is most acutely seen in bibliographic metadata where parallel titles in a left-to-right (e.g., Latin) and right-to-left (e.g. Arabic) script may even occur with third and fourth scripts. Almost every library of research size has this situation with multilingual dictionaries involving languages using scripts with differing directionality.

Information retrieval via the Web in environments such as bibliographic metadata only compounds the problem created by the existence of multiple languages and scripts in the same database. It has led libraries to develop detailed normalization rules for latin-based indexing, sorting, and searching. Work is currently underway to develop or adapt rules for nonlatin scripts for use in library applications, but it would be to libraries' advantage for such rules to be as universal as possible, and for as many Web applications as possible.

In any solution suggested for internationalization of the Web, the special nature of library bibliographic metadata needs to be considered. Bibliographic metadata generally proves to be a good test of the robustness of a system designed to handle rich data in terms of language and script. Libraries require that the Web be able to handle data of the richness that they typically produce, even with Latin-only bibliographic metadata.

High Expectations from Libraries

As longterm storehouses and providers of information, libraries and librarians have high expectations for the Web as a vehicle for dissemination of bibliographic metadata and beyond. The digitization of library materials is quickly allowing libraries to provide not only access to metadata, but to the source materials for which the metadata was created. LOC and many other libraries small and large have noteworthy digitization projects underway to bring valuable materials to users who may not be able to travel to remote sites. Whatever Web technology develops for delivering both metadata and source materials in digitized forms, it MUST be able to preserve the linguistic and script content. Browsers in particular, must be able to render the information in an acceptable way to the end user. Linguistic, graphic, and cultural acceptability are co-factors, the importance of which would be hard to prioritize.

Due to the volume and richness of the information with which libraries deal, cost, quality, and timeliness are also important factors to consider in the development of solutions for internationalizing the Web. The technology that people will need at their fingertips needs to be affordable, user-friendly (end users need to be able to configure a browser to work without difficulty), and available when needed. Despite all the hype about universal character sets and the Web, much still remains to be done to realize the potential inherent in both technologies.

LOC's Contributions

The Library of Congress, as one of the richest repositories of multilingual and multiscript materials in the world, has a vested interest in technologies and standards that will allow it to perform its many-faceted mission to the U.S. Congress, the American people, and global information universe. LOC has played and continues to play a leadership role in developing standards for the Web and implementing them. It serves as the maintenance agency for various American (ANSI) and international (ISO) standards such as MARC (Z39.2/ISO 2709), Information Retrieval (Z39.50), Language codes (ISO 639), just to name a few. It helped to develop most of the important bibliographic character sets in use today and shared that expertise in the development of ISO/IEC 10646 and Unicode. It is committed to implementing Unicode at various levels internally and in its interfaces available to the public.

LOC has an ever-growing Web site that includes much documentation related to its own collections and the standards it maintains. The important MARC and Z39.50 websites are examples of those devoted to standardization. These sites already serve libraries as a primary source of information and provide links to many other sources of standards on the web.

Conclusion

The challenges of the World Wide Web touches virtually everyone with a computer nowadays. Since computers are usually found in libraries, and many libraries contribute to the body of information available via the Web, they are particularly well suited to participate in any discussion of Internationalization of the Web. The challenges faced by libraries has made most library professionals acutely aware of the challenges ahead, particularly for those that deal with information in languages other than English.

The special nature of the information libraries collect, manage, and disseminate makes them key players in Web development. The Library of Congress, as one of the key players in the world of libraries--small, large, and specialized--takes part in many of the discussions ongoing with regard to the Web. Its staff are particularly well suited to address issues which, if dealt with effectively for the Library of Congress, can be dealt with adequately for other libraries with less complex requirements. Libraries are painfully aware that they don't yet possess all the answers with regard to making their information available on the Web. Although universal character sets such as Unicode hold the promise for facilitating dissemination of multilingual information, use of Unicode and related technologies is not fully understood. A forum for examining the challenges of the information that exists and the infrastructure for disseminating it globally via the Web is sorely needed. LOC hopes to play a role in such a forum as the W3C Workshop on Internationalization of the Web.

Randall K. Barry has worked at the U.S. Library of Congress since 1977. He is currently a standards specialist in the Network Development and MARC Standards Office. During his 14 years in that office he has specialized in data structures for bibliographic metadata and other kinds of data (MARC/ISO 2709, SGML/ISO 8879, HTML and XML). He also specializes in standardization for character encodings. He is convener of the ISO/TC46/SC4 Working Group 1, on character sets, and was project leader for the last six nonlatin character sets developed within ISO for library use. Since the publication of ISO/IEC 10646 and Unicode, Mr. Barry has led the effort to shift the focus of libraries from the ISO library sets to ISO/IEC 10646 and Unicode. This included work on mappings of library and MARC character sets to Unicode. Prior to his work with standards, Mr. Barry spent nine years as a cataloger of Romance, Germanic, and Slavic language materials in various divisions of the Library of Congress.

Mr. Barry has a graduate degree in library and information science (M.A., Catholic University of America, 1980), and two undergraduate degrees in foreign languages, with concentration in the Slavic, Romance, and Germanic languages (B.A., University of Maryland, 1976+1985). He is currently working toward a degree in Arabic language.