Intended audience: HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, schema developers (DTDs, XML Schema, RelaxNG, etc.), XSLT developers, Web project managers, and anyone who needs guidance on how to construct language tags.
Which language tag is right for me? How do I choose language and other subtags?
In HTML and XML documents a language tag is used to indicate the language of content.
A language tag is composed of one or more subtags separated by hyphens. Subtags can be of various types.
BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 5646, Tags for the Identification of Languages, and it obsoletes the older RFCs 4646 3066 and 1766.
Language tag syntax is defined by the IETF's BCP 47. In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry. We will describe the new registry below.
This article provides advice on how to choose the components of a language tag. For an overview of the concepts defined in BCP 47, see Language tags in HTML and XML.
All the subtags you will need to create a language tag are found in one place, the IANA Language Subtag Registry. The registry is a long text file, containing nearly 8,000 entries.
The first (and often only) subtag in a language tag always designates a language. It is referred to in BCP 47 as the primary language subtag. We will use that term in this document to refer to the subtag that represents a language, to more clearly make the distinction from 'language tag', which refers to the whole thing.
To find a primary-language subtag, search the page for the name of that language. For example, if you want to label something as French, searching for 'French' in the registry will bring you to a record that looks like this:
%% Type: language Subtag: fr Description: French Added: 2005-10-16 Suppress-Script: Latn %%
Your search will have matched against the
Description field. Check that the type of this record is
language. What you are looking for is the value in the
Subtag field, ie.
The rest of this article will provide advice for choosing primary language subtags and, where needed, other types of subtag. Note that not all the decisions about how to create a language tag are straightforward. There are circumstances where usage will dictate which of various possibilities you should follow.
There are tools available which provide additional help while searching the registry, such as the Language Subtag Lookup tool.
Think about letter-case. By convention, primary language subtags are lowercase, script subtags begin with an uppercase letter, and continue with lowercase, and region subtags are uppercase. This is only a convention, however, and you are free to use whatever letter-casing you like.
On the other hand, you may be using language tags in a context where letter-case is important, such as file names on some systems. In such cases, you should ensure that you follow a consistent policy for letter-case; for any new system that is not case-insensitive, it is recommended that you follow the BCP 47 conventions.
You always start by choosing a primary language subtag, and often this is all you'll need for your language tag.
Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used.
When looking for a primary language subtag, there are a number of things to bear in mind.
You could look up language information in the SIL Ethnologue and cross-reference that information with Wikipedia. The Ethnologue uses the same three-letter codes as BCP47, but you'll need to convert BCP47 2-letter codes to their ISO 639-3 counterpart to look up a language by code. (The Language Subtag Lookup tool does this for you.)
There are a small number of cases where different language codes are available for what many people would regard as the same language, eg. Filipino and Tagalog, or Twi and Akan. There is no indication in the registry as to which you should use, but you should try to ensure that within a single application or context you are consistent.
Scope: collection, this subtag represents a group of languages that are descended from a common ancestor, are spoken in the same geographical area, or are otherwise related.
You should look for a more specific subtag for the language you are interested in. Unfortunately, the subtag registry doesn't provide any pointers for this.
You can use these subtags if there is no more specific subtag available, and it is always preferable to use one of these rather than the subtags
MUL (multiple languages) or
Scopefield set to
macrolanguage, ie. this primary language subtag encompasses a number of more specific primary language subtags in the registry.
ku (Kurdish) is a macrolanguage that encompasses
ckb (Central Kurdish),
kmr (Northern Kurdish), and
sdh (Southern Kurdish).
You can find the more specific (ie. the encompassed) subtags by searching the registry for
Macrolanguage: <subtag_name>. Alternatively, the Language Subtag Lookup tool will automatically list these for a given macrolanguage (example).
As we recommended for the collection subtags mentioned above, in most cases you should try to use the more specific subtags, but there are a small number of important exceptions. These are situations where you should continue using a macrolanguage subtag for reasons of backward compatibility.
For example, although BCP 47 explains that
zh (the macrolanguage subtag for Chinese) doesn't actually specify which of the many, sometimes mutually unintelligible, dialects of Chinese is actually meant by this subtag, in practice convention overwhelmingly associates the macrolanguage subtag with the predominant language among the encompassed subtags - in this case,
cmn (Mandarin Chinese). If your application identified Mandarin Chinese in the past using the language tag
zh-CN (Chinese as used in Mainland China), or even just
zh, you can continue to use
zh in this way. Using
cmn-CN may cause serious compatibility problems if the software or users expect a tag such as
If, on the other hand, you are using
zh to refer to another Chinese dialect such as Hakka, you should use the language subtag
Deprecatedfield you shouldn't use this subtag. Usually the registry will indicate which alternative you should use in the
Preferred-Valuefield. For example, the subtag record for
iw(Hebrew) contains the two following fields:
Deprecated: 1989-01-01 Preferred-Value: he
This indicates that you should use the subtag
he for Hebrew instead.
In the past, when dealing with lists of ISO codes, there were sometimes multiple codes for a given language - there could be a 2-letter code and one or two 3-letter codes. This ambiguity is resolved by the IANA Subtag Registry: only one code is listed per language. (If an ISO 2-letter code exists, that will be the code, otherwise it will be a three-letter code.) The registry maintainer also coordinates the ongoing evolution of the registry with developments in the ISO world.
The BCP 47 specification allows for an additional, 3-letter subtag immediately after the initial primary language subtag. This is called an extended language subtag (abbreviated to extlang). Only a relatively small number of extended language subtags are defined, and they each need to be used with a specific primary language subtag (given in the
Prefix field of the entry for the extended language subtag in the registry).
Currently only seven primary language subtags can be used with extended language subtags. Six of those have a
Scope field set to
macrolanguage in the registry (
zh), and the other is
Consider the following:
Where possible, use a single language subtag, rather than the language+extlang pair.
There is always a 3-letter subtag that is equivalent to any language+extlang pairing, and it is always the same as the extlang subtag. For example,
zh-yue (Cantonese Chinese) can also be expressed with the single subtag
The only significant exception is where the language+extlang sequence is established practice for the system you are working with; that is, where
zh-yue would be preferred rather than
yue to maintain backwards compatibility.
ar(the Arabic macrolanguage subtag) may be more appropriate for Standard Arabic than
arb(the more specific, encompassed subtag that means Standard Arabic).
Similarly, when dealing with the predominant language in the set, it is generally better for backwards compatibility if you replace the language+extlang sequence by just dropping the extlang, rather than using the extlang code as a primary language subtag. For example, reducing
ms (Malay macrolanguage subtag) may sometimes be better than replacing it with
zsm (Standard Malay).
As an example of usage, Unicode's CLDR database uses macrolanguages
zh to represent Mandarin Chinese and
ku to represent Kurdish. Thus for Mandarin Chinese you would use
cmn, and for Northern Kurdish you would use
kmr-Latn. The CLDR database, however, does not use extended language subtags, so you would need to use
yue for Cantonese, not
Script subtags should only be used as part of a language tag when the script adds some useful distinguishing information to the tag. Usually this is because a language is written in more than one script or because the content has been transcribed into a script that is unusual to the language (so one might tag Russian transcribed into the Latin script with a tag such as
Script subtags are always 4 letters, and must come after any language or extended language subtag, but before any other subtags.
Here are things to look out for when choosing a script subtag.
uz-Arab, but the
Arabscript subtag would not be relevant for an audio track.
The script subtag
Zxxx could be used for non-written content, eg.
Zxxx is the
Code for unwritten documents, but again this is only useful if such a distinction has to be made clear.
Suppress-scriptfield set to a given script subtag. For example, the entry in the registry for en (English) contains:
meaning that you should not use the
Latn (Latin) script subtag with this language.
This is because nearly all English documents are written in the Latin script and it adds no distinguishing information. However, if a document were written in English mixing Latin script with another script such as Braille (
Brai), then it might be appropriate to indicate both scripts to aid in content selection (eg. for the application of style rules).
Note, however, that not all language subtags that are strongly associated with a given script have suppress-script fields. You should not assume that you need to use a script if a suppress-script field is absent.
Region subtags associate the language subtag you have chosen with a particular region of the world. Region subtags must come after any language or script subtags.
Like script subtags, you should only use a region subtag if it contributes information needed in a particular context to distinguish this language tag from another one; otherwise leave it out.
en-GB might be a useful distinction for spell-checking, but the region subtag in
ja-JP is unlikely to be useful unless you are intentionally contrasting it with Japanese spoken in other parts of the world.
There are two types of region subtag: 2-letter codes and 3-digit codes. The latter tend to identify multinational regions, rather than specific countries. For example,
es-ES means Spanish as spoken in Spain, whereas
es-419 means Spanish as spoken in Latin America.
Avoid deprecated subtags.
Check that the subtag you intend to use isn't deprecated. In the same way as for other types of subtag, the registry will normally tell what the replacement should be via the
In some cases there is no
Preferred-Value field in a deprecated record, but sometimes the
Comments field contains advice. For example, under YU (Yugoslavia) you will find:
Deprecated: 2003-07-23 Comments: see BA, HR, ME, MK, RS, or SI
Again, only use variant subtags when there is a need to distinguish this language tag from another similar one in the context in which your content is used.
Variant subtags describe additional distinctions not captured by the other subtags. Typically these are dialects, written variations (such as spelling reforms), transcriptions, and the like. A variant subtag is usually five to eight characters long and can contain letters and/or digits. A few four digit subtags (usually representing a year) are also registered. Variant subtags must come after any language, script, and region subtags.
The key thing to look out for when using variant subtags is the order in which they are used.
Check the context and ordering for variant subtags.
Most variant subtag records in the registry have one or more
Prefix fields. The prefixes indicate with which subtags it is usually appropriate to use this variant. For example,
pinyin should generally be used in a language tag that also contains either the subtags
Latn or the subtags
Latn, since the entry for
pinyin contains the following:
Prefix: zh-Latn Prefix: bo-Latn
If you have a good reason, you could use a variant subtag with different subtags, eg.
cmn-Latn-pinyin would be a perfectly legal way to say Mandarin Chinese written with pinyin.
Latn are specified, this is a minimum requirement. It is also possible to include other subtags, such as a region subtag, in the language tag (where appropriate), eg.
Amongst other prefix fields, the entry for variant subtag
which indicates that it should be used in a language tag that already contains two other variant subtags,
biske. Any variant subtag specified in a prefix field should come before the variant you have just looked up.
There are some variant subtags that have no prefix field, eg.
fonipa (International Phonetic Alphabet). Such variants should appear after any other variant subtags with prefix information.
If you plan to use more than one variant without a prefix, order them in terms of decreasing significance. If they are equally significant, order them alphabetically. This will aid interoperability.
These single-character subtags allow for extensions to the language tag. To date, only one extension subtag has been registered. The subtag
u was registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.
For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.
u- extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.
Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement between the parties that use them. They are introduced by a single letter subtag, or 'singleton'. The singleton for private use is
x. Note that any subtags after the singleton can only be 8 characters in length, though you can use multiple subtags.
Private use subtags should be used with great care, and avoided whenever possible , since they interfere with the interoperability that BCP 47 exists to promote.
As an example of a private use subtag,
en-US-x-twain, may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.
Read more in the BCP 47 spec:
Grandfathered tags are special cases, provided for backwards compatibility. They are tags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.
Nearly all grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a
Preferred-Value field that indicates how you ought to represent that language instead. For instance, the entry in the registry for the grandfathered tag
art-lojban indicates that you should use the
jbo language subtag instead.
Note that you should not use additional subtags with a grandfathered tag.