Abstract

This document describes the best practices for identifying or selecting the language of content as well as the the locale preferences used to process or display data values and other information on the Web. It describes how document formats, specifications, and implementations should handle language tags, as well as extensions to language tags that describe the cultural or linguistic preferences referred to in internationalization as a "locale".

Status of This Document

This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.

This is an updated Public Working Draft of "Language Tags and Locale Identifiers for the World Wide Web". The Working Group expects this to become a Working Group Note.

Note

If you wish to make comments regarding this document, please raise a github issue. You may also send email to the list www-international@w3.org (subscribe, archives) as mentioned below. Please include [ltli] at the start of your email's subject. To make it easier to track comments, please raise separate issues or send separate emails for each comment. All comments are welcome.

This document was published by the Internationalization Working Group as a Working Draft. If you wish to make comments regarding this document, please send them to www-international@w3.org (subscribe, archives). All comments are welcome.

Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 1 August 2014 W3C Process Document.

Table of Contents

1. Introduction

This section is informative.

Language tags, as defined in [BCP47], identify the natural language of content on the Web and in Internet protocols and formats, providing the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select the right font for displaying text or a Web page designer might style text differently in one language than in another.

In addition, language tags are also used to identify cultural or linguistic preferences. These are usually related to natural language or regional association of the end user. These preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults presentation for items such as a calendar, units of measurement, or 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, these preferences are usually called a locale.

This document describes best practices for the adoption and use of BCP47 language tags for the identification of natural language content as well as the use of language tags to represent the locale preferences of the user or content author. It describes how document formats, specifications, and implementations should handle the language tags described by [BCP47], as well as data structures that extend these tags to describe international preferences (see also sec. 3.1 in [WS-I18N-SCENARIOS]).

Identification of language and locale has a broad range of applications within the World Wide Web. Existing standards which make use of language identification include the xml:lang attribute in [XML10], the lang and hreflang atttributes in [HTML], the language property in [XSL10], and the :lang pseudo-class in CSS [CSS3-SELECTORS]. Language tags are also used to identify locales, such as in the Unicode Common Locale Data Repository or "CLDR" project [CLDR].

Locales can be identified in several ways, which generally are dependent on the programming language and operating environment of the user. One method is by inference from language tags. For example, an implementation could map a language tag from an existing protocol, such as HTTP's Accept-Language header, to its locale model. Locales may also be identified directly by using the language tag syntax in data items (elements, attributes, headers, etc.) that explicitly serve the purpose of locale identification.

1.1 Out of Scope

This specification does not deal with formats for locale data or actual locale data. One source of locale data and data formats is the Unicode Common Locale Data Repository project ([CLDR]).

2. Notation and Terminology

This section is normative.

2.1 Languages, Language Tags and Matching of Language Tags

This document uses the term language to refer to what is sometimes called a natural language: the spoken, written, or signed communications used by human beings.

There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [BCP47].

[BCP47] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [RFC5646], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [RFC4647], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.

A language tag is a string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [BCP47] language tag. These language tags consist of one or more subtags.

A subtag is a sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [BCP47], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).

Selecting content or behavior based on the language tag requires a few additional concepts defined by [RFC4647]. In this document, we adopt the following terminology:

A language range is a string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".

A language priority list is a collection of one or more language ranges identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [RFC2616] Accept-Language[RFC3282] header is an example of one kind of language priority list.

A basic language range is simply a language tag used to express a language preference. An extended language range allows a more expressive set of language preference through the use of a wildcard subtag *.

2.2 What are Internationalization and Localization?

Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.

International Preferences A user's particular set of cultural conventions, language, and formatting choices that software must employ to correctly process or present information exchanged with that user.

Internationalization The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated I18N because there are eighteen letters between the "i" and the "n" in the English word.

There are many kinds of international preferences that may be offered on the Web in order for the content or service to be considered usable and acceptable by users around the world. Some of these preferences might include:

... and many more.

Because there are a large number of preferences, software systems (operating environments and programming languages) often use an identifier that combines natural language and other information, such as region or country, as a shorthand indicator for collections of preferences that typify categories of users that share certain cultural preferences.

HTML for example uses the lang attribute to indicate the language of segments of content. XML uses the xml:lang attribute for the same purpose.

Java, POSIX, .NET and other software development technologies use a similar-looking (but not identical) construct known as a locale to activate certain internationalized capabilities in software.

Locale A collection of international preferences, generally related to a language and geographic region that a (certain category) of users require. These are usually identified by a shorthand identifier or token, such as a language tag, that is passed from the environment to various processes to get culturally affected behavior.

Generally, systems that are internationalized can support a wide variety of languages and behaviors to meet the international preferences of many kinds of users. When a particular system can respond to changes in the locale by trying to load different resources or performing culturally appropriate formatting we say that this system is locale-aware or enabled

Localization The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as L10N because there are ten letters between the "L" and the "N" in the English word.

When a particular set of content and preferences corresponding to a specific locale is operationally available, then the system is said to be localized.

Localized systems often need to perform matching between the end-user's international preferences (their "locale") and the resources, content, or processing available. This is called Language Negotiation. Language negotiation is, thus, process of matching a user's preferences (in the form of a locale or language tag) to available localized resources. The system searches for matching content or logic "falling-back" from more-specific resources to more-general ones following a deterministic pattern.

Language tags can provide information about the language, script, region, and language variation using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. So, for example, German language users might want to choose between the sort orderings used in a dictionary versus in a phone book.

One way to indicate these preferences is via registered Extensions to [BCP47]. The Unicode Common Locale Data Repository project [CLDR] maintains two such extensions: [RFC6497] defines an extension that describes transformations (generally text transformations, such as transliteration between scripts). [RFC6067] defines Unicode locales, which provide the ability to specify in a language tag a number of the international preference variations that users or content authors might wish to specify directly (such as the German dictionary/phone book difference described above).

Note

Some preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.

3. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, RECOMMENDED, SHOULD, and SHOULD NOT are to be interpreted as described in [RFC2119].

3.1 How to Define Language Tags and How to Reference [BCP47] and Related Standards

Specifications for the Web that require language identification MUST refer to [BCP47].

Specifications SHOULD NOT refer to specific component RFCs. The "BCP" nomenclature refers to the current set of RFCs that form the "best current practice". At the time this document was published, [BCP47] consisted of [RFC5646] (Tags for Identifying Languages) and [RFC4647] (Matching of Language Tags).

Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary. While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [RFC4646], referring to the BCP will not incur additional compliance risk to most implementations.

Some specifications might desire or require compliance with the older language tag grammar, such as found in [RFC1766] or [RFC3066]. This grammar was more permissive and is described in [BCP47] as the production obs-language-tag. Specifications MUST reference [BCP47] and, when backwards compatibility is required, SHOULD reference obs-language-tag rather than referencing one of the obsolete versions of [BCP47]. The obsolete versions referred to by obs-language-tag include [RFC1766] and [RFC3066]. [RFC4646], which introduced the current grammar for language tags, is also obsolete.

Specifications MAY also reference registered extensions to [BCP47] as necessary. In particular, [RFC6067] defines the BCP 47 Extension U, also known as "Unicode Locales". This extension to [BCP47] provides additional subtag sequences for selecting specific locale variations.

Issue 1

Ed.: Add discussion of when to use obs-language-tag. Add discussion and requirements related to the terms 'well-formed' and 'valid'.

4. Locale versus Natural Language

This document defines locale identifiers for use in Web technologies. Historically, language tags [BCP47] have been used as locale identifiers by many programming languages or operating environments, which is natural since locale identifiers usually share certain core features related to natural language and country/region. This specification defines locale identifiers that specific implementations can map to their features to Web standards in order to create functional, interoperable applications.

The minimal requirement is the ability to specify the natural language; thus there is industry convergence on the use of [BCP47] as the core of a locale identifier. For example, [CLDR] uses [BCP47] as the core of a locale identifier, and provides syntax for extensions for non-linguistic information, such as preferred currency. This extension [RFC6067] is defined as a formal extension of [BCP47], making Unicode locale identifiers synonymous with language tags in most contexts.

A major difference between language tags and locale identifiers is the meaning of the region code. In both language tags and locales, the region code indicates variation in language (as with regional dialects) or presentation and format (such as number or date formats). In a locale, the region code is also sometimes used to indicate the physical location, market, legal, or other governing policies for the user.

The language tag may be available in several places. In HTTP, there is an Accept-Language header field which can be used. MIME has a Content-Language header which contains a language tag. In XML, there is an attribute which can be defined for elements called xml:lang. xml:lang marks all the contents and attribute values of the corresponding element as belonging to the language identified. What that means for processing those contents varies from application to application.

For more detailed information on the behavior of xml:lang, see XML 1.0 (Fifth Edition)[XML10].

Issue 2

The I18N WG has existing best practices documentation which may or may not be appropriate to subsume into this document. These include:

5. Language Tags and Locale Values

This section is normative

The following requirements are formulated for specifications which deal with language tags and locale values or matching schemes.

  1. Specifications that make use of language tags or locale values MUST meet the conformance criteria defined for "well-formed" processors, as defined in sec. 2.2.9 of [RFC5646].

  2. Specifications that make use of language tags or locale values MAY validate these values. If they do so, they MUST meet the conformance criteria defined for "validating" processors, as defined in sec. 2.2.9 of [RFC5646].

  3. Specifications that define operations on language tags or locale values using matching MUST use either a basic language range or an extended language range.

  4. Specifications that define operations on language tags or locale values using matching MUST specify whether the resulting language priority list contains a single result (lookup as defined in [RFC4647]), or a possible empty set of results (filtering as defined in [RFC4647]).

Note

Note: Many specifications created before [RFC5646] and [RFC4647] are conformant to these criteria. The purpose of the criteria is to provide a stable source for requirements for language and locale identification.

6. Implementation of this Specification

This section is informative.

Issue 3

[Ed. note: This section will be written in a subsequent working draft.]

6.1 Choice of Language Tag

Note

General guidelines for the choice of language tags are described by articles maintained by the Internationalization Working Group.

When choosing a language tag or locale identifier, use the shortest tag that conveys useful meaning. For example, the phrase Guten Tag, Welt! could use either of the tags de or the tag de-DE-1996-u-cu-EUR-nu-latn. While either tag is valid and identifies the text, the later tag is overly specific.

6.1.1 Use of Script subtags

Script subtags SHOULD NOT be used except in where necessary to distinguish specific varieties of language that normally vary in script. For these languages, the use of the script subtag is RECOMMENDED. Some languages that are known to vary in script include:

Language Language Tag Prefix Common Scripts
Chinese languages zh Hans, Hant
Serbian sr Cyrl, Latn
Azerbaijani az
Cyrl, Latn, Arab
Bambara bm Latn, Nkoo, Arab
Tamazight tzm Latn, Tfng, Arab
Hausa ha Latn, Arab
Uzbek uz Cyrl, Latn
more needed


6.1.2 Use of Non-specific Language Tags

Use of mul

Use of zxx

Use of und

A. Application Scenario: Web Services Internationalization

Issue 4

This section was part of the original LTLI document and exists to address items identified in [WS-I18N-REQ].

In order to enable multi-locale operation of Web services and to create the ability for locale negotiation, this specification describes a standardized method for identifying locales and locale and/or language tags on the Web, including non-normative guidelines for implementation. This is called out in Requirement R005 of "Requirements for the Internationalization of Web Services" [WS-I18N-REQ]. The mechanism for language and locale identification which is defined in this specification will be used in a future version of the description of Web services Internationalization in "Web Services Internationalization (WS-I18N)"[WS-I18N].

Further application scenarios of this specification encompass for example the standards mentioned in Scope of this Specification. The scenarios can be divided in four areas:

As for matching of language tags, many specifications already define operations using matching. An example is the language pseudo-class :lang defined in sec. 5.11.4 of [CSS21]. It matches elements based on their language. This specification formulates requirements on such operations, based on [RFC4647].

B. Revision Log (Non-Normative)

The following changes were made since the revision of 2006-06-20.

The following log records changes that have been made to this document since the publication in April 2006.

C. Acknowledgements

The Internationalization Working Group would like to acknowledge the following contributors to this specification:

D. References

D.1 Normative references

[BCP47]
A. Phillips; M. Davis. Tags for Identifying Languages. September 2009. IETF Best Current Practice. URL: http://tools.ietf.org/html/bcp47
[CLDR]
Common Locale Data Repository. URL: http://cldr.unicode.org
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[RFC4647]
A. Phillips; M. Davis. Matching of Language Tags. September 2006. Best Current Practice. URL: https://tools.ietf.org/html/rfc4647
[RFC5646]
A. Phillips, Ed.; M. Davis, Ed.. Tags for Identifying Languages. September 2009. Best Current Practice. URL: https://tools.ietf.org/html/rfc5646
[RFC6067]
M. Davis; A. Phillips; Y. Umaoka. BCP 47 Extension U. December 2010. Informational. URL: https://tools.ietf.org/html/rfc6067
[RFC6497]
M. Davis; A. Phillips; Y. Umaoka; C. Falk. BCP 47 Extension T - Transformed Content. February 2012. Informational. URL: https://tools.ietf.org/html/rfc6497

D.2 Informative references

[CSS21]
Bert Bos; Tantek Çelik; Ian Hickson; Håkon Wium Lie et al. Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification. 7 June 2011. W3C Recommendation. URL: http://www.w3.org/TR/CSS2
[CSS3-SELECTORS]
Tantek Çelik; Elika Etemad; Daniel Glazman; Ian Hickson; Peter Linss; John Williams et al. Selectors Level 3. 29 September 2011. W3C Recommendation. URL: http://www.w3.org/TR/css3-selectors/
[HTML]
Ian Hickson. HTML. Living Standard. URL: https://html.spec.whatwg.org/
[RFC1766]
H. Alvestrand. Tags for the Identification of Languages. March 1995. Proposed Standard. URL: https://tools.ietf.org/html/rfc1766
[RFC2616]
R. Fielding; J. Gettys; J. Mogul; H. Frystyk; L. Masinter; P. Leach; T. Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1. June 1999. Draft Standard. URL: https://tools.ietf.org/html/rfc2616
[RFC3066]
H. Alvestrand. Tags for the Identification of Languages. January 2001. Best Current Practice. URL: https://tools.ietf.org/html/rfc3066
[RFC3282]
H. Alvestrand. Content Language Headers. May 2002. Draft Standard. URL: https://tools.ietf.org/html/rfc3282
[RFC4646]
A. Phillips; M. Davis. Tags for Identifying Languages. September 2006. Best Current Practice. URL: https://tools.ietf.org/html/rfc4646
[WS-I18N]
Addison Phillips; Mary Trumble; Felix Sasaki. Web Services Internationalization (WS-I18N). 22 May 2012. W3C Note. URL: http://www.w3.org/TR/ws-i18n/
[WS-I18N-REQ]
Addison Phillips. Requirements for the Internationalization of Web Services. 16 November 2004. W3C Note. URL: http://www.w3.org/TR/ws-i18n-req/
[WS-I18N-SCENARIOS]
Debasish Banerjee; Martin Dürst; Michael McKenna; Addison Phillips; Takao Suzuki; Tex Texin; Mary Trumble; Andrea Vine; Kentaro Noji et al. Web Services Internationalization Usage Scenarios. 30 July 2004. W3C Note. URL: http://www.w3.org/TR/ws-i18n-scenarios/
[XML10]
Tim Bray; Jean Paoli; Michael Sperberg-McQueen; Eve Maler; François Yergeau et al. Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 November 2008. W3C Recommendation. URL: http://www.w3.org/TR/xml
[XSL10]
Sharon Adler; Anders Berglund; Jeffrey Caruso; Stephen Deach; Tony Graham; Paul Grosso; Eduardo Gutentag; Alex Milowski; Scott Parnell; Jeremy Richman; Steve Zilles et al. Extensible Stylesheet Language (XSL) Version 1.0. 15 October 2001. W3C Recommendation. URL: http://www.w3.org/TR/xsl/