Choosing a language tag

Answer

Accessing the subtag registry

All the subtags you will need to create a language tag are found in one place, the IANA Language Subtag Registry. The registry is a long text file, containing nearly 8,000 entries.

The first (and often only) subtag in a language tag always designates a language. It is referred to in BCP 47 as the primary language subtag. We will use that term in this document to refer to the subtag that represents a language, to more clearly make the distinction from 'language tag', which refers to the whole thing.

To find a primary-language subtag, search the page for the name of that language. For example, if you want to label something as French, searching for 'French' in the registry will bring you to a record that looks like this:

%%
Type: language
Subtag: fr
Description: French
Added: 2005-10-16
Suppress-Script: Latn
%%

Your search will have matched against the Description field. Check that the type of this record is language. What you are looking for is the value in the Subtag field, ie. fr.

The rest of this article will provide advice for choosing primary language subtags and, where needed, other types of subtag. Note that not all the decisions about how to create a language tag are straightforward. There are circumstances where usage will dictate which of various possibilities you should follow.

There are tools available which provide additional help while searching the registry, such as the Language Subtag Lookup tool.

Think about letter-case. By convention, primary language subtags are lowercase, script subtags begin with an uppercase letter, and continue with lowercase, and region subtags are uppercase. This is only a convention, however, and you are free to use whatever letter-casing you like.

On the other hand, you may be using language tags in a context where letter-case is important, such as file names on some systems. In such cases, you should ensure that you follow a consistent policy for letter-case; for any new system that is not case-insensitive, it is recommended that you follow the BCP 47 conventions.

Decision 1: The primary language subtag

You always start by choosing a primary language subtag, and often this is all you'll need for your language tag.

Read more in the BCP 47 spec:

2.2.1 Primary Language Subtag

4.1 Choice of Language Tag

Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used.

When looking for a primary language subtag, there are a number of things to bear in mind.

Ensure you have the right language. Sometimes, it pays to check a few alternatives. Mark Davis, co-author of BCP47, writes "Often it is not clear which language identifier to use. For example, what most people call Punjabi in Pakistan actually has the code 'lah', and formal name 'Lahnda'. There are many other cases where the same name is used for different languages, or where the name that people search for is not listed in the IANA registry."
You could look up language information in the SIL Ethnologue and cross-reference that information with Wikipedia. The Ethnologue uses the same three-letter codes as BCP47, but you'll need to convert BCP47 2-letter codes to their ISO 639-3 counterpart to look up a language by code. (The Language Subtag Lookup tool does this for you.)

There are a small number of cases where different language codes are available for what many people would regard as the same language, eg. Filipino and Tagalog, or Twi and Akan. There is no indication in the registry as to which you should use, but you should try to ensure that within a single application or context you are consistent.
Avoid collections. If the record you found has a field Scope: collection, this subtag represents a group of languages that are descended from a common ancestor, are spoken in the same geographical area, or are otherwise related.
You should look for a more specific subtag for the language you are interested in. Unfortunately, the subtag registry doesn't provide any pointers for this.

You can use these subtags if there is no more specific subtag available, and it is always preferable to use one of these rather than the subtags MUL (multiple languages) or UND (undefined).
Use macrolanguages with care. Some language subtags have a Scope field set to macrolanguage, ie. this primary language subtag encompasses a number of more specific primary language subtags in the registry.
For example, ku (Kurdish) is a macrolanguage that encompasses ckb (Central Kurdish), kmr (Northern Kurdish), and sdh (Southern Kurdish).

You can find the more specific (ie. the encompassed) subtags by searching the registry for Macrolanguage: <subtag_name>. Alternatively, the Language Subtag Lookup tool will automatically list these for a given macrolanguage (example).

As we recommended for the collection subtags mentioned above, in most cases you should try to use the more specific subtags, but there are a small number of important exceptions. These are situations where you should continue using a macrolanguage subtag for reasons of backward compatibility.

For example, although BCP 47 explains that zh (the macrolanguage subtag for Chinese) doesn't actually specify which of the many, sometimes mutually unintelligible, dialects of Chinese is actually meant by this subtag, in practice convention overwhelmingly associates the macrolanguage subtag with the predominant language among the encompassed subtags - in this case, cmn (Mandarin Chinese). If your application identified Mandarin Chinese in the past using the language tag zh-CN (Chinese as used in Mainland China), or even just zh, you can continue to use zh in this way. Using cmn or cmn-CN may cause serious compatibility problems if the software or users expect a tag such as zh.

If, on the other hand, you are using zh to refer to another Chinese dialect such as Hakka, you should use the language subtag hak instead.
Avoid deprecated subtags. If the subtag record contains a Deprecated field you shouldn't use this subtag. Usually the registry will indicate which alternative you should use in the Preferred-Value field. For example, the subtag record for iw (Hebrew) contains the two following fields:
```
Deprecated: 1989-01-01
Preferred-Value: he
```
This indicates that you should use the subtag he for Hebrew instead.

In the past, when dealing with lists of ISO codes, there were sometimes multiple codes for a given language - there could be a 2-letter code and one or two 3-letter codes. This ambiguity is resolved by the IANA Subtag Registry: only one code is listed per language. (If an ISO 2-letter code exists, that will be the code, otherwise it will be a three-letter code.) The registry maintainer also coordinates the ongoing evolution of the registry with developments in the ISO world.

Decision 2: Extended language subtags

The BCP 47 specification allows for an additional, 3-letter subtag immediately after the initial primary language subtag. This is called an extended language subtag (abbreviated to extlang). Only a relatively small number of extended language subtags are defined, and they each need to be used with a specific primary language subtag (given in the Prefix field of the entry for the extended language subtag in the registry).

Read more in the BCP 47 spec:

2.2.2 Extended Language Subtags

4.1.2 Using Extended Language Subtags

4.1.1 Tagging Encompassed Languages

Currently only seven primary language subtags can be used with extended language subtags. Six of those have a Scope field set to macrolanguage in the registry (ar, kok, ms, sw, uz, and zh), and the other is sgn.

Consider the following:

Where possible, use a single language subtag, rather than the language+extlang pair. There is always a 3-letter subtag that is equivalent to any language+extlang pairing, and it is always the same as the extlang subtag. For example, zh-yue (Cantonese Chinese) can also be expressed with the single subtag yue.

The only significant exception is where the language+extlang sequence is established practice for the system you are working with; that is, where zh-yue would be preferred rather than yue to maintain backwards compatibility.
Take into account legacy usage for predominant languages. In the section about primary language subtags we talked about the predominant language in the set of languages encompassed by a macrolanguage. We said that, to support legacy usage for a given application, it is generally better to use the macrolanguage subtag for the predominant language, rather than the more specific subtag. For example, in such cases ar (the Arabic macrolanguage subtag) may be more appropriate for Standard Arabic than arb (the more specific, encompassed subtag that means Standard Arabic).
Similarly, when dealing with the predominant language in the set, it is generally better for backwards compatibility if you replace the language+extlang sequence by just dropping the extlang, rather than using the extlang code as a primary language subtag. For example, reducing ms-zsm to ms (Malay macrolanguage subtag) may sometimes be better than replacing it with zsm (Standard Malay).

As an example of usage, Unicode's CLDR database uses macrolanguages zh to represent Mandarin Chinese and ku to represent Kurdish. Thus for Mandarin Chinese you would use zh, not cmn, and for Northern Kurdish you would use ku-Latn, not kmr-Latn. The CLDR database, however, does not use extended language subtags, so you would need to use yue for Cantonese, not zh-yue.

Decision 3: Script subtags

Script subtags should only be used as part of a language tag when the script adds some useful distinguishing information to the tag. Usually this is because a language is written in more than one script or because the content has been transcribed into a script that is unusual to the language (so one might tag Russian transcribed into the Latin script with a tag such as ru-Latn).

Read more in the BCP 47 spec:

2.2.3 Script Subtag

4.1 Choice of Language Tag

Script subtags are always 4 letters, and must come after any language or extended language subtag, but before any other subtags.

Here are things to look out for when choosing a script subtag.

Don't automatically use for non-written content. Content such as audio recordings should not need one of the usual script subtags. For example, a movie that is subtitled in Uzbek using the Arabic (rather than Latin) script might label the subtitles uz-Arab, but the Arab script subtag would not be relevant for an audio track.
The script subtag Zxxx could be used for non-written content, eg. uz-Zxxx, as Zxxx is the Code for unwritten documents, but again this is only useful if such a distinction has to be made clear.
Check for suppress-script fields in the language subtag. Some language subtags have a Suppress-script field set to a given script subtag. For example, the entry in the registry for en (English) contains:
Suppress-Script: Latn

meaning that you should not use the Latn (Latin) script subtag with this language.

This is because nearly all English documents are written in the Latin script and it adds no distinguishing information. However, if a document were written in English mixing Latin script with another script such as Braille (Brai), then it might be appropriate to indicate both scripts to aid in content selection (eg. for the application of style rules).

Note, however, that not all language subtags that are strongly associated with a given script have suppress-script fields. You should not assume that you need to use a script if a suppress-script field is absent.

Decision 4: Region subtags

Region subtags associate the language subtag you have chosen with a particular region of the world. Region subtags must come after any language or script subtags.

Like script subtags, you should only use a region subtag if it contributes information needed in a particular context to distinguish this language tag from another one; otherwise leave it out.

Read more in the BCP 47 spec:

2.2.4 Region Subtag

4.1 Choice of Language Tag

For example, en-GB might be a useful distinction for spell-checking, but the region subtag in ja-JP is unlikely to be useful unless you are intentionally contrasting it with Japanese spoken in other parts of the world.

There are two types of region subtag: 2-letter codes and 3-digit codes. The latter tend to identify multinational regions, rather than specific countries. For example, es-ES means Spanish as spoken in Spain, whereas es-419 means Spanish as spoken in Latin America.

Avoid deprecated subtags. Check that the subtag you intend to use isn't deprecated. In the same way as for other types of subtag, the registry will normally tell what the replacement should be via the Preferred-Value field.

In some cases there is no Preferred-Value field in a deprecated record, but sometimes the Comments field contains advice. For example, under YU (Yugoslavia) you will find:
```
Deprecated: 2003-07-23
Comments: see BA, HR, ME, MK, RS, or SI
```

Decision 5: Variant subtags

Again, only use variant subtags when there is a need to distinguish this language tag from another similar one in the context in which your content is used.

Read more in the BCP 47 spec:

2.2.5 Variant Subtags

4.1 Choice of Language Tag

Variant subtags describe additional distinctions not captured by the other subtags. Typically these are dialects, written variations (such as spelling reforms), transcriptions, and the like. A variant subtag is usually five to eight characters long and can contain letters and/or digits. A few four digit subtags (usually representing a year) are also registered. Variant subtags must come after any language, script, and region subtags.

The key thing to look out for when using variant subtags is the order in which they are used.

Check the context and ordering for variant subtags. Most variant subtag records in the registry have one or more Prefix fields. The prefixes indicate with which subtags it is usually appropriate to use this variant. For example, pinyin should generally be used in a language tag that also contains either the subtags zh and Latn or the subtags bo and Latn, since the entry for pinyin contains the following:
```
Prefix: zh-Latn
Prefix: bo-Latn
```
If you have a good reason, you could use a variant subtag with different subtags, eg. cmn-Latn-pinyin would be a perfectly legal way to say Mandarin Chinese written with pinyin.

Although zh, bo and Latn are specified, this is a minimum requirement. It is also possible to include other subtags, such as a region subtag, in the language tag (where appropriate), eg. zh-Latn-CN-pinyin.

Amongst other prefix fields, the entry for variant subtag 1994 contains
```
Prefix: sl-rozaj-biske
```
which indicates that it should be used in a language tag that already contains two other variant subtags, rozaj and biske. Any variant subtag specified in a prefix field should come before the variant you have just looked up.

There are some variant subtags that have no prefix field, eg. fonipa (International Phonetic Alphabet). Such variants should appear after any other variant subtags with prefix information.

If you plan to use more than one variant without a prefix, order them in terms of decreasing significance. If they are equally significant, order them alphabetically. This will aid interoperability.

Decision 6: Extension subtags

Read more in the BCP 47 spec:

2.2.6 Extension Subtags

4.1 Choice of Language Tag

These single-character subtags allow for extensions to the language tag. There are currently two extension subtags registered.

The subtag u was registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.

For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.

de-DE-u-co-phonebk

The u- extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.

The subtag t was also registered by the Unicode Consortium and is used for transformed content. It allows for the specification of content that has been transliterated, transcribed, or translated, or in some other way influenced by the source language. For example, the following indicates Ukrainian text that has been transcribed from Cyrillic to Latin:

uk-Latn-t-uk-cyrl

The t- extension is defined in RFC 6497 and, like the u- extension, the details of the subtags that follow it are provided by the Unicode Consortium's CLDR.

Additionally, the field separator subtag m0 can be used before certain subtags to denote a specific version or variant. For example, the following indicates a dated version of the UNGEGN transliteration specification for Hebrew to Latin:

he-Latn-t-he-hebr-m0-UNGEGN-2007

Decision 7: Private Use subtags

Read more in the BCP 47 spec:

2.2.7 Private Use Subtags

4.1 Choice of Language Tag

Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement between the parties that use them. They are introduced by a single letter subtag, or 'singleton'. The singleton for private use is x. Note that any subtags after the singleton can only be 8 characters in length, though you can use multiple subtags.

Private use subtags should be used with great care, and avoided whenever possible , since they interfere with the interoperability that BCP 47 exists to promote.

As an example of a private use subtag, en-US-x-twain, may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.

Grandfathered tags

Read more in the BCP 47 spec:

4.1 Choice of Language Tag

Grandfathered tags are special cases, provided for backwards compatibility. They are tags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.

Nearly all grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value field that indicates how you ought to represent that language instead. For instance, the entry in the registry for the grandfathered tag art-lojban indicates that you should use the jbo language subtag instead.

Note that you should not use additional subtags with a grandfathered tag.

Choosing a Language Tag

Related Links

Question

Answer

Accessing the subtag registry

Decision 1: The primary language subtag

Decision 2: Extended language subtags

Decision 3: Script subtags

Decision 4: Region subtags

Decision 5: Variant subtags

Decision 6: Extension subtags

Decision 7: Private Use subtags

Grandfathered tags

Further reading