Language Tags and Locale Identifiers for the World Wide Web

Abstract

This document describes the best practices for identifying or selecting the language of content as well as the the locale preferences used to process or display data values and other information on the Web. It describes how document formats, specifications, and implementations should handle language tags, as well as extensions to language tags that describe the cultural or linguistic preferences referred to in internationalization as a "locale".

1. Introduction

This section is informative.

Language tags, as defined in [BCP47], identify the natural language of content on the Web and in Internet protocols and formats, providing the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select the right font for displaying text or a Web page designer might style text differently in one language than in another.

In addition, language tags are also used to identify cultural or linguistic preferences. These are usually related to natural language or regional association of the end user. These preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults presentation for items such as a calendar, units of measurement, or 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, these preferences are usually called a locale.

This document describes best practices for the adoption and use of BCP47 language tags for the identification of natural language content as well as the use of language tags to represent the locale preferences of the user or content author. It describes how document formats, specifications, and implementations should handle the language tags described by [BCP47], as well as data structures that extend these tags to describe international preferences (see also sec. 3.1 in [WS-I18N-SCENARIOS]).

Identification of language and locale has a broad range of applications within the World Wide Web. Existing standards which make use of language identification include the xml:lang attribute in [XML10], the lang and hreflang atttributes in [HTML], the language property in [XSL10], and the :lang pseudo-class in CSS [CSS3-SELECTORS]. Language tags are also used to identify locales, such as in the Unicode Common Locale Data Repository or "CLDR" project [CLDR].

Locales can be identified in several ways, which generally are dependent on the programming language and operating environment of the user. One method is by inference from language tags. For example, an implementation could map a language tag from an existing protocol, such as HTTP's Accept-Language header, to its locale model. Locales may also be identified directly by using the language tag syntax in data items (elements, attributes, headers, etc.) that explicitly serve the purpose of locale identification.

1.1 Out of Scope

This specification does not deal with formats for locale data or actual locale data. One source of locale data and data formats is the Unicode Common Locale Data Repository project ([CLDR]).

2. Notation and Terminology

This section is normative.

2.1 Languages, Language Tags and Matching of Language Tags

This document uses the term language to refer to what is sometimes called a natural language: the spoken, written, or signed communications used by human beings.

There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [BCP47].

[BCP47] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [RFC5646], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [RFC4647], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.

A language tag is a string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [BCP47] language tag. These language tags consist of one or more subtags.

A subtag is a sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [BCP47], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).

Selecting content or behavior based on the language tag requires a few additional concepts defined by [RFC4647]. In this document, we adopt the following terminology:

A language range is a string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".

A language priority list is a collection of one or more language ranges identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [RFC2616] Accept-Language[RFC3282] header is an example of one kind of language priority list.

A basic language range is simply a language tag used to express a language preference. An extended language range allows a more expressive set of language preference through the use of a wildcard subtag *.

Example 1

Basic versus extended language range and language priority list

The string de-de is a basic language range. It matches, for example, the language tag de-DE-1996, but not the language tag de-Deva.

The string de-*-DE is an extended language range. It matches all of the following tags:

de-DE
de-DE-x-goethe
de-Latn-DE-1996

"en; fr; zh-Hant" is a language priority list. It would be read as "English before French before Chinese as written in the Traditional script". Note that the syntax shown is only an example, since it depends on the protocol, application, or implementation that uses the list.

2.2 What are Internationalization and Localization?

Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.

International Preferences A user's particular set of cultural conventions, language, and formatting choices that software must employ to correctly process or present information exchanged with that user.

Internationalization The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated I18N because there are eighteen letters between the "i" and the "n" in the English word.

There are many kinds of international preferences that may be offered on the Web in order for the content or service to be considered usable and acceptable by users around the world. Some of these preferences might include:

Natural language for text processing: parsing, spell checking, and grammar checking are examples of this
User interface language, which may include items like images, colors, sounds, formats, and navigational elements as well as the visible strings
Presentation (human-oriented formatting) of dates, times, numbers, lists, and other values
Collation, sorting, and organization of content (such as in a phone book or a dictionary)
Alternate time-keeping and calendars, which may include holidays, work rules, weekday/weekend distinctions, the number and organization of months, the numbering of years, and so forth
Tax or regulatory regime
Currency

... and many more.

Because there are a large number of preferences, software systems (operating environments and programming languages) often use an identifier that combines natural language and other information, such as region or country, as a shorthand indicator for collections of preferences that typify categories of users that share certain cultural preferences.

HTML for example uses the lang attribute to indicate the language of segments of content. XML uses the xml:lang attribute for the same purpose.

Java, POSIX, .NET and other software development technologies use a similar-looking (but not identical) construct known as a locale to activate certain internationalized capabilities in software.

Locale A collection of international preferences, generally related to a language and geographic region that a (certain category) of users require. These are usually identified by a shorthand identifier or token, such as a language tag, that is passed from the environment to various processes to get culturally affected behavior.

Generally, systems that are internationalized can support a wide variety of languages and behaviors to meet the international preferences of many kinds of users. When a particular system can respond to changes in the locale by trying to load different resources or performing culturally appropriate formatting we say that this system is locale-aware or enabled.

Localization The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as L10N because there are ten letters between the "L" and the "N" in the English word.

When a particular set of content and preferences corresponding to a specific locale is operationally available, then the system is said to be localized.

Localized systems often need to perform matching between the end-user's international preferences (their "locale") and the resources, content, or processing available. This is called Language Negotiation. Language negotiation is, thus, process of matching a user's preferences (in the form of a locale or language tag) to available localized resources. The system searches for matching content or logic "falling-back" from more-specific resources to more-general ones following a deterministic pattern.

Language tags can provide information about the language, script, region, and language variation using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. So, for example, German language users might want to choose between the sort orderings used in a dictionary versus in a phone book.

One way to indicate these preferences is via registered Extensions to [BCP47]. The Unicode Common Locale Data Repository project [CLDR] maintains two such extensions: [RFC6497] defines an extension that describes transformations (generally text transformations, such as transliteration between scripts). [RFC6067] defines Unicode locales, which provide the ability to specify in a language tag a number of the international preference variations that users or content authors might wish to specify directly (such as the German dictionary/phone book difference described above).

Note

Some preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.

3. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MAY, MUST, RECOMMENDED, SHOULD, and SHOULD NOT are to be interpreted as described in [RFC2119].

3.1 How to Define Language Tags and How to Reference [BCP47] and Related Standards

Specifications for the Web that require language identification MUST refer to [BCP47].

Specifications SHOULD NOT refer to specific component RFCs. The "BCP" nomenclature refers to the current set of RFCs that form the "best current practice". At the time this document was published, [BCP47] consisted of [RFC5646] (Tags for Identifying Languages) and [RFC4647] (Matching of Language Tags).

Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary. While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [RFC4646], referring to the BCP will not incur additional compliance risk to most implementations.

Some specifications might desire or require compliance with the older language tag grammar, such as found in [RFC1766] or [RFC3066]. This grammar was more permissive and is described in [BCP47] as the production obs-language-tag. Specifications MUST reference [BCP47] and, when backwards compatibility is required, SHOULD reference obs-language-tag rather than referencing one of the obsolete versions of [BCP47]. The obsolete versions referred to by obs-language-tag include [RFC1766] and [RFC3066]. [RFC4646], which introduced the current grammar for language tags, is also obsolete.

Specifications MAY also reference registered extensions to [BCP47] as necessary. In particular, [RFC6067] defines the BCP 47 Extension U, also known as "Unicode Locales". This extension to [BCP47] provides additional subtag sequences for selecting specific locale variations.

Issue 1

Ed.: Add discussion of when to use obs-language-tag. Add discussion and requirements related to the terms 'well-formed' and 'valid'.

4. Locale versus Natural Language

This document defines locale identifiers for use in Web technologies. Historically, language tags [BCP47] have been used as locale identifiers by many programming languages or operating environments, which is natural since locale identifiers usually share certain core features related to natural language and country/region. This specification defines locale identifiers that specific implementations can map to their features to Web standards in order to create functional, interoperable applications.

The minimal requirement is the ability to specify the natural language; thus there is industry convergence on the use of [BCP47] as the core of a locale identifier. For example, [CLDR] uses [BCP47] as the core of a locale identifier, and provides syntax for extensions for non-linguistic information, such as preferred currency. This extension [RFC6067] is defined as a formal extension of [BCP47], making Unicode locale identifiers synonymous with language tags in most contexts.

A major difference between language tags and locale identifiers is the meaning of the region code. In both language tags and locales, the region code indicates variation in language (as with regional dialects) or presentation and format (such as number or date formats). In a locale, the region code is also sometimes used to indicate the physical location, market, legal, or other governing policies for the user.

The language tag may be available in several places. In HTTP, there is an Accept-Language header field which can be used. MIME has a Content-Language header which contains a language tag. In XML, there is an attribute which can be defined for elements called xml:lang. xml:lang marks all the contents and attribute values of the corresponding element as belonging to the language identified. What that means for processing those contents varies from application to application.

For more detailed information on the behavior of xml:lang, see XML 1.0 (Fifth Edition)[XML10].

Issue 2

The I18N WG has existing best practices documentation which may or may not be appropriate to subsume into this document. These include:

Language	Language Tag Prefix	Common Scripts
Chinese languages	zh	Hans, Hant
Serbian	sr	Cyrl, Latn
Azerbaijani	az	Cyrl, Latn, Arab
Bambara	bm	Latn, Nkoo, Arab
Tamazight	tzm	Latn, Tfng, Arab
Hausa	ha	Latn, Arab
Uzbek	uz	Cyrl, Latn
more needed

D. References

D.1 Normative references

[BCP47]: A. Phillips; M. Davis. Tags for Identifying Languages. September 2009. IETF Best Current Practice. URL: http://tools.ietf.org/html/bcp47
[CLDR]: Common Locale Data Repository. URL: http://cldr.unicode.org
[RFC2119]: S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[RFC4647]: A. Phillips; M. Davis. Matching of Language Tags. September 2006. Best Current Practice. URL: https://tools.ietf.org/html/rfc4647
[RFC5646]: A. Phillips, Ed.; M. Davis, Ed.. Tags for Identifying Languages. September 2009. Best Current Practice. URL: https://tools.ietf.org/html/rfc5646
[RFC6067]: M. Davis; A. Phillips; Y. Umaoka. BCP 47 Extension U. December 2010. Informational. URL: https://tools.ietf.org/html/rfc6067
[RFC6497]: M. Davis; A. Phillips; Y. Umaoka; C. Falk. BCP 47 Extension T - Transformed Content. February 2012. Informational. URL: https://tools.ietf.org/html/rfc6497

D.2 Informative references

[CSS21]: Bert Bos; Tantek Çelik; Ian Hickson; Håkon Wium Lie et al. Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Specification. 7 June 2011. W3C Recommendation. URL: http://www.w3.org/TR/CSS2
[CSS3-SELECTORS]: Tantek Çelik; Elika Etemad; Daniel Glazman; Ian Hickson; Peter Linss; John Williams et al. Selectors Level 3. 29 September 2011. W3C Recommendation. URL: http://www.w3.org/TR/css3-selectors/
[HTML]: Ian Hickson. HTML. Living Standard. URL: https://html.spec.whatwg.org/
[RFC1766]: H. Alvestrand. Tags for the Identification of Languages. March 1995. Proposed Standard. URL: https://tools.ietf.org/html/rfc1766
[RFC2616]: R. Fielding; J. Gettys; J. Mogul; H. Frystyk; L. Masinter; P. Leach; T. Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1. June 1999. Draft Standard. URL: https://tools.ietf.org/html/rfc2616
[RFC3066]: H. Alvestrand. Tags for the Identification of Languages. January 2001. Best Current Practice. URL: https://tools.ietf.org/html/rfc3066
[RFC3282]: H. Alvestrand. Content Language Headers. May 2002. Draft Standard. URL: https://tools.ietf.org/html/rfc3282
[RFC4646]: A. Phillips; M. Davis. Tags for Identifying Languages. September 2006. Best Current Practice. URL: https://tools.ietf.org/html/rfc4646
[WS-I18N]: Addison Phillips; Mary Trumble; Felix Sasaki. Web Services Internationalization (WS-I18N). 22 May 2012. W3C Note. URL: http://www.w3.org/TR/ws-i18n/
[WS-I18N-REQ]: Addison Phillips. Requirements for the Internationalization of Web Services. 16 November 2004. W3C Note. URL: http://www.w3.org/TR/ws-i18n-req/
[WS-I18N-SCENARIOS]: Debasish Banerjee; Martin Dürst; Michael McKenna; Addison Phillips; Takao Suzuki; Tex Texin; Mary Trumble; Andrea Vine; Kentaro Noji et al. Web Services Internationalization Usage Scenarios. 30 July 2004. W3C Note. URL: http://www.w3.org/TR/ws-i18n-scenarios/
[XML10]: Tim Bray; Jean Paoli; Michael Sperberg-McQueen; Eve Maler; François Yergeau et al. Extensible Markup Language (XML) 1.0 (Fifth Edition). 26 November 2008. W3C Recommendation. URL: http://www.w3.org/TR/xml
[XSL10]: Sharon Adler; Anders Berglund; Jeffrey Caruso; Stephen Deach; Tony Graham; Paul Grosso; Eduardo Gutentag; Alex Milowski; Scott Parnell; Jeremy Richman; Steve Zilles et al. Extensible Stylesheet Language (XSL) Version 1.0. 15 October 2001. W3C Recommendation. URL: http://www.w3.org/TR/xsl/

Language Tags and Locale Identifiers for the World Wide Web

W3C Working Draft 23 April 2015

Abstract

Status of This Document

Table of Contents

1. Introduction

1.1 Out of Scope

2. Notation and Terminology

2.1 Languages, Language Tags and Matching of Language Tags

2.2 What are Internationalization and Localization?

3. Conformance

3.1 How to Define Language Tags and How to Reference [BCP47] and Related Standards

4. Locale versus Natural Language

5. Language Tags and Locale Values

6. Implementation of this Specification

6.1 Choice of Language Tag

6.1.1 Use of Script subtags

6.1.2 Use of Non-specific Language Tags

Use of `mul`

Use of `zxx`

Use of `und`

A. Application Scenario: Web Services Internationalization

B. Revision Log (Non-Normative)

C. Acknowledgements

D. References

D.1 Normative references

D.2 Informative references

Abstract

Status of This Document

Table of Contents

1. Introduction

1.1 Out of Scope

2. Notation and Terminology

2.1 Languages, Language Tags and Matching of Language Tags

2.2 What are Internationalization and Localization?

3. Conformance

3.1 How to Define Language Tags and How to Reference [BCP47] and Related Standards

4. Locale versus Natural Language

5. Language Tags and Locale Values

6. Implementation of this Specification

6.1 Choice of Language Tag

6.1.1 Use of Script subtags

6.1.2 Use of Non-specific Language Tags

Use of mul

Use of zxx

Use of und

A. Application Scenario: Web Services Internationalization

B. Revision Log (Non-Normative)

C. Acknowledgements

D. References

D.1 Normative references

D.2 Informative references

Use of `mul`

Use of `zxx`

Use of `und`