Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, schema developers (DTDs, XML Schema, RelaxNG, etc.), XSLT developers, Web project managers, and anyone who needs guidance on how to construct language tags.
Updated 2011-02-22 20:40
Which language tag is right for me? How do I choose language and other subtags?
In HTML and XML documents a language tag is used to indicate the language of content.
A language tag is composed of one or more subtags separated by hyphens. Subtags can be of various types.
BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 5646, Tags for the Identification of Languages, and it obsoletes the older RFCs 4646 3066 and 1766.
Language tag syntax is defined by the IETF's BCP 47. In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry. We will describe the new registry below.
This article provides advice on how to choose the components of a language tag. For an overview of the concepts defined in BCP 47, see Language tags in HTML and XML.
Addison Phillips and Mark Davis, authors of BCP 47, provided guidance during the writing of this article.
All the subtags you will need to create a language tag are found in one place, the IANA Language Subtag Registry. The registry is a long text file, containing nearly 8,000 entries.
The notes on this page provide guidance that is sufficient for most people wanting to use language tags. There are links to relevant sections of BCP 47 in this margin for people who want to read the full text of the specification.
Note, also, that some environments or systems may dictate choices that are different from what you would otherwise expect. For example, in Java you must use "iw" (deprecated in BCP47) in place of "he" (recommended in BCP47).
The first (and often only) subtag in a language tag always designates a language. It is referred to in BCP 47 as the primary language subtag. We will use that term in this document to refer to the subtag that represents a language, to more clearly make the distinction from 'language tag', which refers to the whole thing.
To find a primary-language subtag, search the page for the name of that language. For example, if you want to label something as French, searching for 'French' in the registry will bring you to a record that looks like this:
%% Type: language Subtag: fr Description: French Added: 2005-10-16 Suppress-Script: Latn %%
Your search will have matched against the Description field. Check that the type of this record is language. What you are looking for is the value in the Subtag field, ie. fr.
The rest of this article will provide advice for choosing primary language subtags and, where needed, other types of subtag. Note that not all the decisions about how to create a language tag are straightforward. There are circumstances where usage will dictate which of various possibilities you should follow.
There are tools available which provide additional help while searching the registry, such as Richard Ishida's Language Subtag Lookup tool.
Think about letter-case. By convention, primary language subtags are lowercase, script subtags begin with an uppercase letter, and continue with lowercase, and region subtags are uppercase. This is only a convention, however, and you are free to use whatever letter-casing you like.
On the other hand, you may be using language tags in a context where letter-case is important, such as file names on some systems. In such cases, you should ensure that you follow a consistent policy for letter-case; for any new system that is not case-insensitive, it is recommended that you follow the BCP 47 conventions.
You always start by choosing a primary language subtag, and often this is all you'll need for your language tag.
Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used.
When looking for a primary language subtag, there are a number of things to bear in mind.
You could look up language information in the SIL Ethnologue and cross-reference that information with Wikipedia. The Ethnologue uses the same three-letter codes as BCP47, but you'll need to convert BCP47 2-letter codes to their ISO 639-3 counterpart to look up a language by code. (Richard Ishida's tool does this for you.)
There are a small number of cases where different language codes are available for what many people would regard as the same language, eg. Filipino and Tagalog, or Twi and Akan. There is no indication in the registry as to which you should use, but you should try to ensure that within a single application or context you are consistent.
You should look for a more specific subtag for the language you are interested in. Unfortunately, the subtag registry doesn't provide any pointers for this.
You can use these subtags if there is no more specific subtag available, and it is always preferable to use one of these rather than the subtags MUL (multiple languages) or UND (undefined).
For example, ku (Kurdish) is a macrolanguage that encompasses ckb (Central Kurdish), kmr (Northern Kurdish), and sdh (Southern Kurdish).
You can find the more specific (ie. the encompassed) subtags by searching the registry for Macrolanguage: <subtag_name>. Alternatively, Richard Ishida's lookup tool will automatically list these for a given macrolanguage (example).
As we recommended for the collection subtags mentioned above, in most cases you should try to use the more specific subtags, but there are a small number of important exceptions. These are situations where you should continue using a macrolanguage subtag for reasons of backward compatibility.
For example, although BCP 47 explains that zh (the macrolanguage subtag for Chinese) doesn't actually specify which of the many, sometimes mutually unintelligible, dialects of Chinese is actually meant by this subtag, in practice convention overwhelmingly associates the macrolanguage subtag with the predominant language among the encompassed subtags - in this case, cmn (Mandarin Chinese). If your application identified Mandarin Chinese in the past using the language tag zh-CN (Chinese as used in Mainland China), or even just zh, you can continue to use zh in this way. Using cmn or cmn-CN may cause serious compatibility problems if the software or users expect a tag such as zh.
If, on the other hand, you are using zh to refer to another Chinese dialect such as Hakka, you should use the language subtag hak instead.
Deprecated: 1989-01-01 Preferred-Value: he
This indicates that you should use the subtag he for Hebrew instead.
In the past, when dealing with lists of ISO codes, there were sometimes multiple codes for a given language - there could be a 2-letter code and one or two 3-letter codes. This ambiguity is resolved by the IANA Subtag Registry: only one code is listed per language. (If an ISO 2-letter code exists, that will be the code, otherwise it will be a three-letter code.) The registry maintainer also coordinates the ongoing evolution of the registry with developments in the ISO world.
The BCP 47 specification allows for an additional, 3-letter subtag immediately after the initial primary language subtag. This is called an extended language subtag (abbreviated to extlang). Only a relatively small number of extended language subtags are defined, and they each need to be used with a specific primary language subtag (given in the Prefix field of the entry for the extended language subtag in the registry).
Currently only seven primary language subtags can be used with extended language subtags. Six of those have a Scope field set to macrolanguage in the registry (ar, kok, ms, sw, uz, and zh), and the other is sgn.
Consider the following:
Where possible, use a single language subtag, rather than the language+extlang pair. There is always a 3-letter subtag that is equivalent to any language+extlang pairing, and it is always the same as the extlang subtag. For example, zh-yue (Cantonese Chinese) can also be expressed with the single subtag yue.
The only significant exception is where the language+extlang sequence is established practice for the system you are working with; that is, where zh-yue would be preferred rather than yue to maintain backwards compatibility.
Similarly, when dealing with the predominant language in the set, it is generally better for backwards compatibility if you replace the language+extlang sequence by just dropping the extlang, rather than using the extlang code as a primary language subtag. For example, reducing ms-zsm to ms (Malay macrolanguage subtag) may sometimes be better than replacing it with zsm (Standard Malay).
As an example of usage, Unicode's CLDR database uses macrolanguages zh to represent Mandarin Chinese and ku to represent Kurdish. Thus for Mandarin Chinese you would use zh, not cmn, and for Northern Kurdish you would use ku-Latn, not kmr-Latn. The CLDR database, however, does not use extended language subtags, so you would need to use yue for Cantonese, not zh-yue.
Script subtags should only be used as part of a language tag when the script adds some useful distinguishing information to the tag. Usually this is because a language is written in more than one script or because the content has been transcribed into a script that is unusual to the language (so one might tag Russian transcribed into the Latin script with a tag such as ru-Latn).
Script subtags are always 4 letters, and must come after any language or extended language subtag, but before any other subtags.
Here are things to look out for when choosing a script subtag.
The script subtag Zxxx could be used for non-written content, eg. uz-Zxxx, as Zxxx is the Code for unwritten documents, but again this is only useful if such a distinction has to be made clear.
meaning that you should not use the Latn (Latin) script subtag with this language.
This is because nearly all English documents are written in the Latin script and it adds no distinguishing information. However, if a document were written in English mixing Latin script with another script such as Braille (Brai), then it might be appropriate to indicate both scripts to aid in content selection (eg. for the application of style rules).
Note, however, that not all language subtags that are strongly associated with a given script have suppress-script fields. You should not assume that you need to use a script if a suppress-script field is absent.
Region subtags associate the language subtag you have chosen with a particular region of the world. Region subtags must come after any language or script subtags.
Like script subtags, you should only use a region subtag if it contributes information needed in a particular context to distinguish this language tag from another one; otherwise leave it out.
For example, en-GB might be a useful distinction for spell-checking, but the region subtag in ja-JP is unlikely to be useful unless you are intentionally contrasting it with Japanese spoken in other parts of the world.
There are two types of region subtag: 2-letter codes and 3-digit codes. The latter tend to identify multinational regions, rather than specific countries. For example, es-ES means Spanish as spoken in Spain, whereas es-419 means Spanish as spoken in Latin America.
Avoid deprecated subtags. Check that the subtag you intend to use isn't deprecated. In the same way as for other types of subtag, the registry will normally tell what the replacement should be via the Preferred-Value field.
In some cases there is no Preferred-Value field in a deprecated record, but sometimes the Comments field contains advice. For example, under YU (Yugoslavia) you will find:
Deprecated: 2003-07-23 Comments: see BA, HR, ME, MK, RS, or SI
Again, only use variant subtags when there is a need to distinguish this language tag from another similar one in the context in which your content is used.
Variant subtags describe additional distinctions not captured by the other subtags. Typically these are dialects, written variations (such as spelling reforms), transcriptions, and the like. A variant subtag is usually five to eight characters long and can contain letters and/or digits. A few four digit subtags (usually representing a year) are also registered. Variant subtags must come after any language, script, and region subtags.
The key thing to look out for when using variant subtags is the order in which they are used.
Check the context and ordering for variant subtags. Most variant subtag records in the registry have one or more Prefix fields. The prefixes indicate with which subtags it is usually appropriate to use this variant. For example, pinyin should generally be used in a language tag that also contains either the subtags zh and Latn or the subtags bo and Latn, since the entry for pinyin contains the following:
Prefix: zh-Latn Prefix: bo-Latn
If you have a good reason, you could use a variant subtag with different subtags, eg. cmn-Latn-pinyin would be a perfectly legal way to say Mandarin Chinese written with pinyin.
Although zh, bo and Latn are specified, this is a minimum requirement. It is also possible to include other subtags, such as a region subtag, in the language tag (where appropriate), eg. zh-Latn-TW-pinyin.
Amongst other prefix fields, the entry for variant subtag 1994 contains
which indicates that it should be used in a language tag that already contains two other variant subtags, rozaj and biske. Any variant subtag specified in a prefix field should come before the variant you have just looked up.
There are some variant subtags that have no prefix field, eg. fonipa (International Phonetic Alphabet). Such variants should appear after any other variant subtags with prefix information.
If you plan to use more than one variant without a prefix, order them in terms of decreasing significance. If they are equally significant, order them alphabetically. This will aid interoperability.
These single-character subtags allow for extensions to the language tag. To date, only one extension subtag has been registered. The subtag u was registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.
For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.
The u- extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.
Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement between the parties that use them. They are introduced by a single letter subtag, or 'singleton'. The singleton for private use is
x. Note that any subtags after the singleton can only be 8 characters in length, though you can use multiple subtags.
Private use subtags should be used with great care, and avoided whenever possible, since they interfere with the interoperability that BCP 47 exists to promote.
As an example of a private use subtag, en-US-x-twain, may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.
Read more in the BCP 47 spec:
Grandfathered tags are special cases, provided for backwards compatibility. They are tags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.
Nearly all grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value field that indicates how you ought to represent that language instead. For instance, the entry in the registry for the grandfathered tag art-lojban indicates that you should use the jbo language subtag instead.
Note that you should not use additional subtags with a grandfathered tag.
Tell us what you think (English).
Content first published 2009-12-03. Last substantive update 2011-02-22 20:40 GMT. This version 2011-02-22 20:40 GMT
For the history of document changes, search for qa-choosing-language-tags in the i18n blog.
Copyright © 2009-2011 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.