[ contents ]

W3C

Language Tags and Locale Identifiers for the World Wide Web

W3C Working Draft TODO TODO TODO

This version:
TODO
Latest version:
http://www.w3.org/TR/ltli/
Previous version:
http://www.w3.org/TR/2006/WD-ltli-20060612/
Editor:
Felix Sasaki, W3C

This document is also available in these non-normative formats: XML.


Abstract

Based on [BCP 47], currently represented by [RFC 4646] and [RFC 4647], this document describes mechanisms for identifying or selecting the language of content or locale preferences used to process information using Web technologies. It describes how document formats, specifications, and implementations should handle language tags, as well as data structures that extend these tags to describe international preferences.

Status of this Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is an updated Public Working Draft of "Language and Locale Identifiers for the World Wide Web (LTLI)".

This document describes mechanisms for identifying or selecting the language of content or locale preferences used to process information using Web technologies. It describes how document formats, specifications, and implementations should handle language tags, as well as data structures that extend these tags to describe international preferences.

This document was developed by the Internationalization Core Working Group, part of the W3C Internationalization Activity. The Working Group does not expect to advance this Working Draft to Recommendation Status. A complete list of changes to this document is available.

Send your comments to www-i18n-comments@w3.org. Use "[Comments on ltli WD]" in the subject line of your email, followed by a brief subject. The archives for this list are publicly available.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

Table of Contents

Appendices

A Normative References
B References (Non-Normative)
C Referencing BCP 47 (Non-Normative)
D Revision Log (Non-Normative)

Go to the table of contents.1 Introduction

This section is informative.

Go to the table of contents.1.1 Scope of this Specification

This document describes mechanisms for identifying or selecting the language of content or locale preferences used to process information using Web technologies. It describes how document formats, specifications, and implementations should handle the language tags described by [BCP 47].

Go to the table of contents.1.2 Out of Scope

This specification will not deal with formats for locale data or actual locale data. One possible source of locale data and data formats is [LDML].

Go to the table of contents.1.3 Application Scenarios

Identification of language and locale has a broad range of applications within the World Wide Web. Existing standards which make use of language identification includes the xml:lang attribute in [XML 1.0], the lang and hreflang attributes in [HTML 4.01], or the language property in [XSL 1.0]. Locale identification is used for example within the CLDR project, cf. [LDML].

This specification defines normative guidelines and Best Practices in four areas:

  • Definition of values for language tags

  • Definition of values for locale identifiers

  • Definition of matching schemes for language tags

  • Definition of matching schemes for locale identifiers

Go to the table of contents.2 Notation and Terminology

This section is normative.

Go to the table of contents.2.1 Language Tags and Matching of Language Tags

This document uses the terms language tag and subtag which are defined in [RFC 4646] or its successor.

In addition, this document uses the following terms, which are defined in [RFC 4647] or its successor:

  • language range

  • basic language range (see sec. 2.1 of [RFC 4647])

  • extended language range (see sec. 2.2 of [RFC 4647])

  • language priority list (see sec. 2.3 of [RFC 4647])

For examples, see Section 4: Matching of Language Tags.

Go to the table of contents.3 Language Tags and Locale Values

This section is informative.

Go to the table of contents.3.1 What is a Locale?

[Ed. note: This text is taken from the LTLI wiki and was produced by Francois.]

BEST PRACTICE: Use language as the core of locale identifiers.

"Locale" is a fairly old concept coming from the field of software localization in the 1980's. Localization is understood to mean doing whatever it takes to adapt a piece of software to a given group of users; we're talking about large groups here, such as a whole country or all the speakers of a certain language. The "locale", then, is the set of "things" common to this group, from the point of view of the software being localized. The most important part of localization is the translation of all text to the language of the users, so that they can understand it. But there are other aspects:

  • Traditionally, translating to another language often meant using another character set, which in turn required adapting the software to deal with that character set. Therefore "charset" was deemed to be part of a locale, e.g. in the POSIX locale model.

  • Apart from static text, which simply gets translated, software often generates or interprets text by itself. Even primitive applications were often able to interpret the user-provided answers to Yes/No questions (the answer being either "Y" for Yes or "N" for No). Thus the single letters used for Yes and No in a given language became part of the locale data for that language. And similarly for things such as dates, which software would often generate from a binary data value or interpret from user input. The software then needs to know the conventional order of components (year, month, day) and maybe even the names of the months, etc.

  • Depending on the particular application, many other things may be subject to adaptation during localization, and may therefore be considered part of the "locale". There is general agreement that language is the core part of the locale and is always present, which is not the case for any other "aspect" of a locale.

  • In many systems the notion of locale allows for customization, and thus is not tied to a particular language/country combination. For example, many systems allow customized date or time formats, number formats, choice of measurement system, and so on.

  • The concept of locale sometimes has little to do with software localization; it is simply a general bundle of preferences or other information associated with a user, such as the country of residence, the country of citizenship, and so on.

Go to the table of contents.3.2 Language Tags versus Locale Identifiers

BEST PRACTICE: If possible, protocols should have the same field for conveying language and locale information, except in cases where the notion of locale encompasses extended information (see Section 3.1: What is a Locale?).

Historically, natural language identifiers [RFC 4646] have been used as locale identifiers by some programming languages or operating environments, which is natural since locale identifiers usually share certain core features related to natural language and country/region.

A major difference between language tags and locale identifiers is the meaning of the region code. In both language tags and locales, the region code indicates variation in language (as with regional dialects) or presentation and format (such as number or date formats). In a locale, the region code is also sometimes used to indicate the physical location, market, legal, or other governing policies for the user.

The language tag may be available in several places. In HTTP, there is an Accept-Language header field which can be used. MIME has a Content-Language header which contains a language tag. In XML, there is an attribute which can be defined for elements called xml:lang. xml:lang marks all the contents and attribute values of the corresponding element as belonging to the language identified. What that means for processing those contents varies from application to application.

For more detailed information on the behavior of xml:lang, see [XML 1.0].

Note: This document does not aim to identify the information which may be part of a locale in addition to language. For these purpose, the reader should rely on standards mentioned in Section 3.3: Standards for Language Identifiers and Locale Identification

Go to the table of contents.3.3 Standards for Language Identifiers and Locale Identification

[Ed. note: Purpose: Propose BCP 47 as the base standard for locale identifiers.]

BEST PRACTICE: Use [BCP 47] for language identifiers and as the basis for locale identification.

A minimal requirement for locale identifiers is the ability to specify the natural language; thus there is industry convergence on the use of [RFC 4646] or its successor as the core of a locale identifier. For example, [CLDR] uses [RFC 4646] or its successor as the core of a locale identifier, and provides syntax for extensions for non-linguistic information, such as preferred currency or timezone.

[RFC 4646] or its successor refer to language identification only. Locales can be identified in several ways. One method is by inference from language tags. For example, an implementation could map a language tag from an existing protocol, such as HTTP's Accept-Language header, to its locale model. Locales may also be identified directly by using the language tag syntax in data items (elements, attributes, headers, etc.) that explicitly serve the purpose of locale identification.

Go to the table of contents.3.4 Canonicalization Of Locale Identifiers

[Ed. note: PURPOSE: Describe Formats which are based upon BCP 47 but need canonicalization (e.g. underscore "_" to hyphen "-").]

BEST PRACTICE: For locale identification, create variants of [BCP 47] identifiers which can be canonicalized.

Go to the table of contents.3.5 Legacy Locale Formats

[Ed. note: Purpose: Describe Legacy formats for locale identification like POSIX.]

BEST PRACTICE: Avoid if possible the usage of legacy formats.

Go to the table of contents.3.6 Specification and processing of Language and Locale on the Web

[Ed. note: Purpose: Describe best practices for different "users" (browser / other clients, server) how to process language identifiers or locale information.]

Go to the table of contents.4 Matching of Language Tags

BEST PRACTICE: Use matching schemes which are based on basic and / or extended languages ranges as defined in [RFC 4647] or its successor.

Many specifications already define operations using matching. An example is the language pseudo-class :lang defined in sec. 5.11.4 of [CSS 2.1]. It matches elements based on their language. This specification formulates requirements on such operations, based on [RFC 4647] or its successor. An example for matching of language ranges is given below.

Example 1: Basic versus extended language range and language priority list

de-de is a basic language range. It matches e.g. the language tag de-DE-1996, but not the language tag de-Deva.

de-*-DE is an extended language range. It matches all of the following tags:

  • de-DE

  • de-DE-x-goethe

  • de-Latn-DE-1996

"en; fr; zh-Hant" is a language priority list. It would be read as "English before French before Chinese as written in the Traditional script". Note that the syntax shown is only an example, since it depends on the protocol, application, or implementation that uses the list.

Go to the table of contents.5 Conformance

This section is normative

This section explains the conditions that specifications have to fulfill to be able to claim conformance to this specification.

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC 2119].

Go to the table of contents.6 Requirements for Language Tags and Locale Values

This section is normative

The following requirements are formulated for specifications who deal with language tags and locale values or matching schemes.

  1. Specifications that make use of language tags or locale values MUST meet the conformance criteria defined for "well-formed" processors, as defined in sec. 2.2.9 of [RFC 4646].

  2. Specifications that make use of language tags or locale values MAY validate these values. If they do so, they MUST meet the conformance criteria defined for "validating" processors, as defined in sec. 2.2.9 of [RFC 4646].

  3. Specifications that define operations on language tags or locale values using matching Must use either a basic language range or an extended language range as defined in sec. 2.1 and 2.2. [RFC 4647].

  4. Specifications that define operations on language tags or locale values using matching MUST specify whether the resulting language priority list contains a single result (lookup as defined in [RFC 4647]), or a possible empty set of results (filtering as defined in [RFC 4647]).

Note: Many specifications which have been created before [RFC 4646] and [RFC 4647] are conformant to these criteria. The purpose of the criteria is to provide a stable source for requirements for language and locale identification.

Go to the table of contents.A Normative References

BCP 47
Best Common Practice for the Identification of Language. At the time of writing this document, BCP 47 is represented by [RFC 4646] and [RFC 4647].
RFC 2119
S. Bradner. Key Words for use in RFCs to Indicate Requirement Levels. IETF March 1997. Available at http://www.ietf.org/rfc/rfc2119.txt.
RFC 4646
Addison Phillips, Mark Davis, editors. Tags for the Identification of Languages, IETF September 2006. Available at http://www.ietf.org/rfc/rfc4646.txt.
RFC 4647
Addison Phillips, Mark Davis, editors. Matching of Language Tags, IETF September 2006. Available at http://www.ietf.org/rfc/rfc4647.txt.
RFC 3987
Martin Dürst, Michael Suignard. Internationalized Resource Identifiers (IRIs). IETF January 2005. Available at http://www.ietf.org/rfc/rfc3987.txt.

Go to the table of contents.B References (Non-Normative)

CLDR
Common Locale Data Registry (CLDR). Available at http://unicode.org/cldr/.
CSS 2.1
Bert Bos, Tantek Çelik, Ian Hickson, Håkon Wium Lie. Cascading Style Sheets, level 2 revision 1. W3C Working Draft 13 June 2005. Available at http://www.w3.org/TR/2005/WD-CSS21-20050613/. The latest version of CSS 2.1 is available at http://www.w3.org/TR/CSS21/.
HTML 4.01
Dave Ragget, Arnaud Le Hors, Ian Jacobs, eds. HTML 4.01 Specification. W3C Recommendation 24 December 1999. Available at http://www.w3.org/TR/1999/REC-html401-19991224/. The latest version of HTML 4.01 is available at http://www.w3.org/TR/html401/.
LDML
Mark Davis. Locale Data Markup Language (LDML), Unicode Technical Standard #35. Available at http://unicode.org/reports/tr35/tr35-5.html. The latest version of LDML is available at http://unicode.org/reports/tr35/.
RFC 3066
H. Alvestrand, editor. Tags for the Identification of Languages, IETF January 2001. Available at http://www.ietf.org/rfc/rfc3066.txt.
WS-I18N
Addison Phillips, Mary Trumble. Web Services Internationalization (WS-I18N). W3C Working Draft 14 September 2005. Available at http://www.w3.org/TR/2005/WD-ws-i18n-20050914/. The latest version of WS i18n is available at http://www.w3.org/TR/ws-i18n/.
WS-I18N Req
Addison Phillips. Requirements for the Internationalization of Web Services. W3C Working Group Note 16 November 2004. Available at http://www.w3.org/TR/2004/NOTE-ws-i18n-req-20041116/. The latest version of Ws i18n Req is available at http://www.w3.org/TR/ws-i18n-req/.
WS-I18N Scenarios
Debasish Banerjee, Martin Dürst, Mike McKenna, Addison Phillips, Takao Suzuki, Tex Texin, Mary Trumble, Andrea Vine, Kentaro Noji. Web Services Internationalization Usage Scenarios. W3C Working Group Note 30 July 2004. Available at http://www.w3.org/TR/2004/NOTE-ws-i18n-scenarios-20040730/. The latest version of WS i18n Scenarios is available at http://www.w3.org/TR/ws-i18n-scenarios/.
XML 1.0
Tim Bray, Jean Paoli, C.M. Sperberg-McQueen, et al., eds. Extensible Markup Language (XML) 1.0 (Third Edition), W3C Recommendation 04 February 2004. Available at http://www.w3.org/TR/2004/REC-xml-20040204/. The latest version of XML 1.0 is available at http://www.w3.org/TR/REC-xml/.
XSL 1.0
Sharon Adler et al., eds. Extensible Stylesheet Language (XSL) Version 1.0. W3C Recommendation 15 October 2001. Available at http://www.w3.org/TR/2001/REC-xsl-20011015/. The latest version of XSL 1.0 is available at http://www.w3.org/TR/xsl/.

Go to the table of contents.C Referencing BCP 47 (Non-Normative)

At the time of writing of this document, many specifications refer to [RFC 3066]The best practice when developing specifications for language identification is to refer to [BCP 47]. [BCP 47] is currently represented by [RFC 4646] and [RFC 4647]. This specification takes [RFC 4646] as the basis for language identification, and [RFC 4647] as the basis for matching of language identifiers ("tags").

Go to the table of contents.D Revision Log (Non-Normative)

The following log records changes that have been made to this document since the publication in June 2006.

The following log records changes that have been made to this document since the publication in April 2006.