Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang). W3C liability, trademark and document use rules apply.
This document describes the best practices for identifying or selecting the language of content as well as the the locale preferences used to process or display data values and other information on the Web. It describes how document formats, specifications, and implementations should handle language tags, as well as extensions to language tags that describe the cultural or linguistic preferences referred to in internationalization as a "locale".
This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at http://www.w3.org/TR/.
This is an updated Public Working Draft of "Language Tags and Locale Identifiers for the World Wide Web". The Working Group expects this to become a Working Group Note.
If you wish to make comments regarding this document, please raise a github issue. You may also send
email to the list www-international@w3.org
(subscribe,
archives)
as mentioned below. Please include [ltli]
at the start of your
email's subject. To make it easier to track comments, please raise
separate issues or send separate emails for each comment. All comments
are welcome.
This document was published by the Internationalization Working Group as a Working Draft. If you wish to make comments regarding this document, please send them to www-international@w3.org (subscribe, archives). All comments are welcome.
Publication as a Working Draft does not imply endorsement by the W3C Membership. This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.
This document was produced by a group operating under the 5 February 2004 W3C Patent Policy. The group does not expect this document to become a W3C Recommendation. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.
This document is governed by the 1 August 2014 W3C Process Document.
This section is informative.
Language tags, as defined in [BCP47], identify the natural language of content on the Web and in Internet protocols and formats, providing the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select the right font for displaying text or a Web page designer might style text differently in one language than in another.
In addition, language tags are also used to identify cultural or linguistic preferences. These are usually related to natural language or regional association of the end user. These preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults presentation for items such as a calendar, units of measurement, or 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, these preferences are usually called a locale.
This document describes best practices for the adoption and use of BCP47 language tags for the identification of natural language content as well as the use of language tags to represent the locale preferences of the user or content author. It describes how document formats, specifications, and implementations should handle the language tags described by [BCP47], as well as data structures that extend these tags to describe international preferences (see also sec. 3.1 in [WS-I18N-SCENARIOS]).
Identification of language and locale has a broad range of
applications within the World Wide Web. Existing standards which make
use of language identification include the xml:lang
attribute in [XML10], the lang
and hreflang
atttributes in [HTML], the language
property in
[XSL10], and the :lang
pseudo-class in CSS
[CSS3-SELECTORS]. Language tags are also used to identify locales,
such as in the Unicode Common Locale Data Repository or "CLDR" project
[CLDR].
Locales can be identified in several ways, which generally are dependent on the programming language and operating environment of the user. One method is by inference from language tags. For example, an implementation could map a language tag from an existing protocol, such as HTTP's Accept-Language header, to its locale model. Locales may also be identified directly by using the language tag syntax in data items (elements, attributes, headers, etc.) that explicitly serve the purpose of locale identification.
This specification does not deal with formats for locale data or actual locale data. One source of locale data and data formats is the Unicode Common Locale Data Repository project ([CLDR]).
This section is normative.
This document uses the term language to refer to what is sometimes called a natural language: the spoken, written, or signed communications used by human beings.
There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [BCP47].
[BCP47] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [RFC5646], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [RFC4647], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.
A language tag is a string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [BCP47] language tag. These language tags consist of one or more subtags.
A subtag is a sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [BCP47], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).
Selecting content or behavior based on the language tag requires a few additional concepts defined by [RFC4647]. In this document, we adopt the following terminology:
A language range is a string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".
A language priority list
is a collection of one or more language ranges identifying the user's
language preferences for use in matching. As the name suggests, such
lists are normally ordered or weighted according to the user's
preferences. The HTTP [RFC2616] Accept-Language
[RFC3282]
header is an example of one kind of language priority list.
A basic language range is
simply a language tag used to express a language preference. An extended language range allows
a more expressive set of language preference through the use of a
wildcard subtag
.*
Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.
International Preferences A user's particular set of cultural conventions, language, and formatting choices that software must employ to correctly process or present information exchanged with that user.
Internationalization The design
and development of a product that is enabled for target audiences that
vary in culture, region, or language. Internationalization
is sometimes abbreviated I18N
because there are eighteen letters between the "i" and the "n" in the
English word.
There are many kinds of international preferences that may be offered on the Web in order for the content or service to be considered usable and acceptable by users around the world. Some of these preferences might include:
Because there are a large number of preferences, software systems (operating environments and programming languages) often use an identifier that combines natural language and other information, such as region or country, as a shorthand indicator for collections of preferences that typify categories of users that share certain cultural preferences.
HTML for example uses the lang
attribute to indicate
the language of segments of content. XML uses the xml:lang
attribute for the same purpose.
Java, POSIX, .NET and other software development technologies use a similar-looking (but not identical) construct known as a locale to activate certain internationalized capabilities in software.
Locale A collection of international preferences, generally related to a language and geographic region that a (certain category) of users require. These are usually identified by a shorthand identifier or token, such as a language tag, that is passed from the environment to various processes to get culturally affected behavior.
Generally, systems that are internationalized can support a wide variety of languages and behaviors to meet the international preferences of many kinds of users. When a particular system can respond to changes in the locale by trying to load different resources or performing culturally appropriate formatting we say that this system is locale-aware or enabled.
Localization The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as L10N because there are ten letters between the "L" and the "N" in the English word.
When a particular set of content and preferences corresponding to a specific locale is operationally available, then the system is said to be localized.
Localized systems often need to perform matching between the end-user's international preferences (their "locale") and the resources, content, or processing available. This is called Language Negotiation. Language negotiation is, thus, process of matching a user's preferences (in the form of a locale or language tag) to available localized resources. The system searches for matching content or logic "falling-back" from more-specific resources to more-general ones following a deterministic pattern.
Language tags can provide information about the language, script, region, and language variation using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. So, for example, German language users might want to choose between the sort orderings used in a dictionary versus in a phone book.
One way to indicate these preferences is via registered Extensions to [BCP47]. The Unicode Common Locale Data Repository project [CLDR] maintains two such extensions: [RFC6497] defines an extension that describes transformations (generally text transformations, such as transliteration between scripts). [RFC6067] defines Unicode locales, which provide the ability to specify in a language tag a number of the international preference variations that users or content authors might wish to specify directly (such as the German dictionary/phone book difference described above).
Some preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MAY, MUST, RECOMMENDED, SHOULD, and SHOULD NOT are to be interpreted as described in [RFC2119].
Specifications for the Web that require language identification MUST refer to [BCP47].
Specifications SHOULD NOT refer to specific component RFCs. The "BCP" nomenclature refers to the current set of RFCs that form the "best current practice". At the time this document was published, [BCP47] consisted of [RFC5646] (Tags for Identifying Languages) and [RFC4647] (Matching of Language Tags).
Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary. While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [RFC4646], referring to the BCP will not incur additional compliance risk to most implementations.
Some specifications might desire or require compliance with the older
language tag grammar, such as found in [RFC1766] or [RFC3066].
This grammar was more permissive and is described in [BCP47] as the
production obs-language-tag
. Specifications MUST
reference [BCP47] and, when backwards compatibility is required,
SHOULD reference obs-language-tag
rather than
referencing one of the obsolete versions of [BCP47]. The obsolete
versions referred to by obs-language-tag
include
[RFC1766] and [RFC3066]. [RFC4646], which introduced the current
grammar for language tags, is also obsolete.
Specifications MAY also reference registered extensions to [BCP47] as necessary. In particular, [RFC6067] defines the BCP 47 Extension U, also known as "Unicode Locales". This extension to [BCP47] provides additional subtag sequences for selecting specific locale variations.
Ed.: Add discussion of when to use obs-language-tag. Add discussion and requirements related to the terms 'well-formed' and 'valid'.
This document defines locale identifiers for use in Web technologies. Historically, language tags [BCP47] have been used as locale identifiers by many programming languages or operating environments, which is natural since locale identifiers usually share certain core features related to natural language and country/region. This specification defines locale identifiers that specific implementations can map to their features to Web standards in order to create functional, interoperable applications.
The minimal requirement is the ability to specify the natural language; thus there is industry convergence on the use of [BCP47] as the core of a locale identifier. For example, [CLDR] uses [BCP47] as the core of a locale identifier, and provides syntax for extensions for non-linguistic information, such as preferred currency. This extension [RFC6067] is defined as a formal extension of [BCP47], making Unicode locale identifiers synonymous with language tags in most contexts.
A major difference between language tags and locale identifiers is the meaning of the region code. In both language tags and locales, the region code indicates variation in language (as with regional dialects) or presentation and format (such as number or date formats). In a locale, the region code is also sometimes used to indicate the physical location, market, legal, or other governing policies for the user.
The language tag may be available in several places. In HTTP, there is
an Accept-Language header field which can be used. MIME has a
Content-Language header which contains a language tag. In XML, there is
an attribute which can be defined for elements called xml:lang
.
xml:lang
marks all the contents and attribute values of the
corresponding element as belonging to the language identified. What that
means for processing those contents varies from application to
application.
For more detailed information on the behavior of xml:lang
,
see XML 1.0 (Fifth Edition)[XML10].
The I18N WG has existing best practices documentation which may or may not be appropriate to subsume into this document. These include:
This section is informative.
[Ed. note: This section will be written in a subsequent working draft.]
General guidelines for the choice of language tags are described by articles maintained by the Internationalization Working Group.
When choosing a language tag or locale identifier, use the shortest
tag that conveys useful meaning. For example, the phrase Guten
Tag, Welt!
could use either of the tags de
or the
tag de-DE-1996-u-cu-EUR-nu-latn
. While either tag is
valid and identifies the text, the later tag is overly specific.
mul
zxx
und
This section was part of the original LTLI document and exists to address items identified in [WS-I18N-REQ].
In order to enable multi-locale operation of Web services and to create the ability for locale negotiation, this specification describes a standardized method for identifying locales and locale and/or language tags on the Web, including non-normative guidelines for implementation. This is called out in Requirement R005 of "Requirements for the Internationalization of Web Services" [WS-I18N-REQ]. The mechanism for language and locale identification which is defined in this specification will be used in a future version of the description of Web services Internationalization in "Web Services Internationalization (WS-I18N)"[WS-I18N].
Further application scenarios of this specification encompass for example the standards mentioned in Scope of this Specification. The scenarios can be divided in four areas:
Definition of values for language tags
Definition of values for locale identifiers
Definition of matching schemes for language tags
Definition of matching schemes for locale identifiers
As for matching of language tags, many specifications already define
operations using matching. An example is the language pseudo-class :lang
defined in sec.
5.11.4 of [CSS21]. It matches elements based on their language.
This specification formulates requirements on such operations, based on
[RFC4647].
The following changes were made since the revision of 2006-06-20.
The following log records changes that have been made to this document since the publication in April 2006.
The informative introductory section has been rewritten thoroughly, including the description of the scope of the document, of application scenarios and of the separation of locale versus natural language.
Terms which rely on [BCP47] are not defined anymore, but only reference these documents. In addition, examples for these terms were created.
The requirements for language and locale values have been taken out of the conformance section and are now placed in the body of the document.
A revision log has been created.
The Internationalization Working Group would like to acknowledge the following contributors to this specification: