Choosing a Language Tag

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, schema developers (DTDs, XML Schema, RelaxNG, etc.), XSLT developers, Web project managers, and anyone who needs guidance on how to construct language tags.

Updated

Question

Which language tag is right for me? How do I choose language and other subtags?

Background

In HTML and XML documents a language tag is used to indicate the language of content.

A language tag is composed of one or more subtags separated by hyphens. Subtags can be of various types.

BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 5646, Tags for the Identification of Languages, and it obsoletes the older RFCs 4646 3066 and 1766.

Language tag syntax is defined by the IETF's BCP 47. In the past it was necessary to consult lists of codes in various ISO standards to find the right subtags, but now you only need to look in the IANA Language Subtag Registry. We will describe the new registry below.

This article provides advice on how to choose the components of a language tag. For an overview of the concepts defined in BCP 47, see Language tags in HTML and XML.

Addison Phillips and Mark Davis, authors of BCP 47, provided guidance during the writing of this article.

Answer

Accessing the subtag registry

All the subtags you will need to create a language tag are found in one place, the IANA Language Subtag Registry. The registry is a long text file, containing nearly 8,000 entries.

The notes on this page provide guidance that is sufficient for most people wanting to use language tags. There are links to relevant sections of BCP 47 in this margin for people who want to read the full text of the specification.

Note, also, that some environments or systems may dictate choices that are different from what you would otherwise expect. For example, in Java you must use iw (deprecated in BCP47) in place of he (recommended in BCP47).

The first (and often only) subtag in a language tag always designates a language. It is referred to in BCP 47 as the primary language subtag. We will use that term in this document to refer to the subtag that represents a language, to more clearly make the distinction from 'language tag', which refers to the whole thing.

To find a primary-language subtag, search the page for the name of that language. For example, if you want to label something as French, searching for 'French' in the registry will bring you to a record that looks like this:

%%
Type: language
Subtag: fr
Description: French
Added: 2005-10-16
Suppress-Script: Latn
%%

Your search will have matched against the Description field. Check that the type of this record is language. What you are looking for is the value in the Subtag field, ie. fr.

The rest of this article will provide advice for choosing primary language subtags and, where needed, other types of subtag. Note that not all the decisions about how to create a language tag are straightforward. There are circumstances where usage will dictate which of various possibilities you should follow.

There are tools available which provide additional help while searching the registry, such as the Language Subtag Lookup tool.

Decision 1: The primary language subtag

You always start by choosing a primary language subtag, and often this is all you'll need for your language tag.

Always bear in mind that the golden rule is to keep your language tag as short as possible. Only add further subtags to your language tag if they are needed to distinguish the language from something else in the context where your content is used.

When looking for a primary language subtag, there are a number of things to bear in mind.

In the past, when dealing with lists of ISO codes, there were sometimes multiple codes for a given language - there could be a 2-letter code and one or two 3-letter codes. This ambiguity is resolved by the IANA Subtag Registry: only one code is listed per language. (If an ISO 2-letter code exists, that will be the code, otherwise it will be a three-letter code.) The registry maintainer also coordinates the ongoing evolution of the registry with developments in the ISO world.

Decision 2: Extended language subtags

The BCP 47 specification allows for an additional, 3-letter subtag immediately after the initial primary language subtag. This is called an extended language subtag (abbreviated to extlang). Only a relatively small number of extended language subtags are defined, and they each need to be used with a specific primary language subtag (given in the Prefix field of the entry for the extended language subtag in the registry).

Currently only seven primary language subtags can be used with extended language subtags. Six of those have a Scope field set to macrolanguage in the registry (ar, kok, ms, sw, uz, and zh), and the other is sgn.

Consider the following:

As an example of usage, Unicode's CLDR database uses macrolanguages zh to represent Mandarin Chinese and ku to represent Kurdish. Thus for Mandarin Chinese you would use zh, not cmn, and for Northern Kurdish you would use ku-Latn, not kmr-Latn. The CLDR database, however, does not use extended language subtags, so you would need to use yue for Cantonese, not zh-yue.

Decision 3: Script subtags

Script subtags should only be used as part of a language tag when the script adds some useful distinguishing information to the tag. Usually this is because a language is written in more than one script or because the content has been transcribed into a script that is unusual to the language (so one might tag Russian transcribed into the Latin script with a tag such as ru-Latn).

Read more in the BCP 47 spec:

2.2.3 Script Subtag

4.1 Choice of Language Tag

Script subtags are always 4 letters, and must come after any language or extended language subtag, but before any other subtags.

Here are things to look out for when choosing a script subtag.

Decision 4: Region subtags

Region subtags associate the language subtag you have chosen with a particular region of the world. Region subtags must come after any language or script subtags.

Like script subtags, you should only use a region subtag if it contributes information needed in a particular context to distinguish this language tag from another one; otherwise leave it out.

Read more in the BCP 47 spec:

2.2.4 Region Subtag

4.1 Choice of Language Tag

For example, en-GB might be a useful distinction for spell-checking, but the region subtag in ja-JP is unlikely to be useful unless you are intentionally contrasting it with Japanese spoken in other parts of the world.

There are two types of region subtag: 2-letter codes and 3-digit codes. The latter tend to identify multinational regions, rather than specific countries. For example, es-ES means Spanish as spoken in Spain, whereas es-419 means Spanish as spoken in Latin America.

Decision 5: Variant subtags

Again, only use variant subtags when there is a need to distinguish this language tag from another similar one in the context in which your content is used.

Read more in the BCP 47 spec:

2.2.5 Variant Subtags

4.1 Choice of Language Tag

Variant subtags describe additional distinctions not captured by the other subtags. Typically these are dialects, written variations (such as spelling reforms), transcriptions, and the like. A variant subtag is usually five to eight characters long and can contain letters and/or digits. A few four digit subtags (usually representing a year) are also registered. Variant subtags must come after any language, script, and region subtags.

The key thing to look out for when using variant subtags is the order in which they are used.

Decision 6: Extension subtags

Read more in the BCP 47 spec:

2.2.6 Extension Subtags

4.1 Choice of Language Tag

These single-character subtags allow for extensions to the language tag. To date, only one extension subtag has been registered. The subtag u was registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.

For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.

The u- extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.

Decision 7: Private Use subtags

Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement between the parties that use them. They are introduced by a single letter subtag, or 'singleton'. The singleton for private use is x. Note that any subtags after the singleton can only be 8 characters in length, though you can use multiple subtags.

Grandfathered tags

Read more in the BCP 47 spec:

4.1 Choice of Language Tag

Grandfathered tags are special cases, provided for backwards compatibility. They are tags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.

Nearly all grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value field that indicates how you ought to represent that language instead. For instance, the entry in the registry for the grandfathered tag art-lojban indicates that you should use the jbo language subtag instead.

Note that you should not use additional subtags with a grandfathered tag.