Language tags in HTML and XML

If you want step-by-step guidance for choosing a language tag, you should read Choosing a language tag. What follows here provides more of a high-level overview of the syntax and concepts involved in language tags, as described by BCP 47.

Overview

Terminology

In this article we refer to the value of a language attribute such as fr-CA as a language tag. The fr and CA parts are referred to as subtags when described as parts of a tag. When described as members of an ISO list of languages or countries, fr and CA are referred to as codes.

Language tags are used to indicate the language of text or other items in HTML and XML documents. Use the lang attribute to specify language tags in HTML, and the xml:lang attribute for XML.

In both cases, language information is inherited by elements inside the one where the declaration was made, unless one of those elements declares a different language (in the same way).

RFCs are what the IETF calls its specifications. Each RFC has a unique number. Unfortunately, it is not possible to tell, when reading RFC 1766 or RFC 3066 that these specifications have been obsoleted and replaced by other specifications.

Language tag syntax is defined by the IETF's BCP 47. BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 5646, Tags for the Identification of Languages, and it obsoletes the older RFCs 4646, 3066 and 1766 (see bytheway for more information).

You used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in the IANA Language Subtag Registry. We will describe the new registry below.

Most language tags consist of a two- or three-letter language subtag. Often this is followed by a two-letter or three-digit region subtag. RFC 5646 also allows for a number of additional subtags, where needed. These will be explained briefly in the next section, and include extended language, script, variant, extension and private-use subtags.

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

Examples:

Code	Language	Subtags
`en`	English	language
`mas`	Maasai	language
`fr-CA`	French as used in Canada	language+region
`es-419`	Spanish as used in Latin America	language+region
`zh-Hans`	Chinese written with Simplified script	language+script

The entries in the registry follow certain conventions with regard to upper and lower letter-casing. For example, language tags are lower case, alphabetic region subtags are upper case, and script subtags begin with an initial capital. This is only a convention! When you use these subtags you are free to do as you like, unless you are constrained by the rules of the system you are working with. For HTML and XML language markup, the case should not matter.

HTML and XML also provide a means to prevent inheritance of language using the empty string, ie. xml:lang="". Essentially, this says: I do not want to associate any language with this information.

The remainder of this article provides additional detail on how to construct language tags.

Constructing language tags

The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the sections that follow.

language-extlang-script-region-variant-extension-privateuse

The primary language subtag

Language subtags

en

ast

Read more in the BCP 47 spec:

2.2.1 Primary Language Subtag

4.1 Choice of Language Tag

4.1.1 Tagging Encompassed Languages

All language tags must begin with a primary language subtag.

Examples of simple, language-only language tags include:

en (English)
ast (Asturian - no two-letter code exists for Asturian in the ISO lists)

These codes come from, and are kept up to date with, ISO 639 language codes.

Because RFC 3066 didn't provide a list of valid subtags and just referred users to ISO 639, there was sometimes confusion about how to tag languages when the ISO code lists contained both two-letter and three-letter codes (and sometimes more than one three-letter code). Now all valid subtags are listed in a single IANA registry, which adopts only one value from the ISO lists per language. If a two-letter ISO code is available, this will be the one in the registry. Otherwise the registry will contain one three-letter code. This should make things simpler.

When RFC 5646 was published, over 7,000 new ISO 639-3 three-letter codes were added to the Subtag Registry.

This is an example of the primary language subtag for Spanish, es, in the registry:

%%
Type: language
Subtag: es
Description: Spanish
Description: Castilian
Added: 2005-10-16
Suppress-Script: Latn
%%

Although the codes are case insensitive, they are commonly written lowercased, but this is merely a convention.

The extended language subtag

Extlang subtags

zh-yue

ar-afb

Read more in the BCP 47 spec:

2.2.2 Extended Language Subtags

4.1.2 Using Extended Language Subtags

We will refer to extended language subtags as extlang subtags. An extlang subtag must always be preceded by a specific primary language subtag, there can only be one in a language tag, and it comes before any other subtags.

Examples of language tags including extlang subtags are:

zh-yue (Cantonese Chinese)
ar-afb (Gulf Arabic)

Language+extlang combinations are provided to accommodate legacy language tag forms, however, there is a single language subtag available for every language+extlang combination. That language subtag should be used rather than the language+extlang combination, where possible. For example, use yue rather than zh-yue for Cantonese, and afb rather than ar-afb for Gulf Arabic, if you can.

Extlang subtags are always three letters long. Each extlang entry in the registry contains a Prefix field that specifies the language that must precede the extlang subtag. Entries also include a Preferred-Value field that indicates the equivalent language tag.

This is an example of the extlang code for Gulf Arabic, afb, in the registry:

%%
Type: extlang
Subtag: afb
Description: Gulf Arabic
Added: 2009-07-29
Preferred-Value: afb
Prefix: ar
Macrolanguage: ar
%%

Macrolanguages The primary language subtags used with an extlang subtag are known as macrolanguages, and encompass a number of languages with more specific primary language subtags. The macrolanguage subtag can be used on its own, but unless there is some convention about its meaning in the context where it is used, it is not necessarily precise enough.

For example, zh means Chinese, but it covers many Chinese dialects, often mutually incomprehensible. When zh is used on its own, it is usually used to mean the predominant language in the encompassed range, although this is not explicitly specified in BCP 47. For example, conventionally zh is considered to represent the predominant, Mandarin form of Chinese. Where absolute clarity is needed you can use cmn instead as long as that doesn't break interoperability, however, if you are using zh to represent a language which is not Mandarin, such as Hakka Chinese, you are better off using the explicit code (in that case, hak).

On the other hand, zh-Hans uses zh in its generic sense. This is a useful way to describe writing in Simplified Chinese, since Chinese tends to be written in the same way, regardless of the dialect of the reader.

The script subtag

Script subtags

zh-Hans

az-Latn

Read more in the BCP 47 spec:

2.2.3 Script Subtag

4.1 Choice of Language Tag

Examples of language tags including script subtags are:

zh-Hans (Simplified Chinese)
az-Latn (Azerbaijani, written in Latin script - since Azerbaijani can also be written using the Arabic or Cyrillic script)

The script subtag was first introduced in RFC 4646. The subtags come from, and are kept up to date with, the list of ISO 15924 script codes.

Only one script subtag can appear in a language tag, and it must immediately follow the language or any extlang subtag. It is always four letters long.

You should only use script subtags if they are necessary to make a distinction you need. As RFC 4646 co-author, Addison Phillips, writes, "For virtually any content that does not use a script subtag today, it remains the best practice not to use one in the future".

If you specifically want to indicate that content is not written, there is a subtag for that. For example, you could use en-Zxxx to make it clear that an audio recording in English is not written content.

Actually, many language subtag entries in the registry strongly discourage the use of script subtags by including a Suppress-script field. There is such a field in the Spanish example above, which indicates that Spanish is normally written using Latin script, and so the Latn subtag should normally not be used with es.

This example shows the registry entry for Cyrillic script, Cyrl, used for languages such as Russian:

%%
Type: script
Subtag: Cyrl
Description: Cyrillic
Added: 2005-10-16
%%

Although for common uses of language tags it is not likely that you will need to specify the script, there are one or two situations that have been crying out for it for some time. One such example is Chinese. There are many Chinese dialects, often mutually unintelligible, but these dialects are all written using either Simplified or Traditional Chinese script. People typically want to label Chinese text as either Simplified or Traditional, but until recently there was no way to do so. People had to bend something like zh-CN (meaning Chinese as spoken in China) to mean Simplified Chinese, even in Singapore, and zh-TW (meaning Chinese as spoken in Taiwan) for Traditional Chinese. (Other people, however, use zh-HK for Traditional Chinese.) The availability of zh-Hans and zh-Hant for Chinese written in Simplified and Traditional scripts should improve consistency and accuracy, and is already becoming widely used.

The region subtag

Region subtags

en-GB

es-005

zh-Hant-HK

Read more in the BCP 47 spec:

2.2.4 Region Subtag

4.1 Choice of Language Tag

Examples of language tags including region subtags include:

en-GB (British English)
es-005 (South American Spanish)
zh-Hant-HK (Traditional Chinese as used in Hong Kong)

The region subtag in RFC 3066 took its values from the ISO 3166 country codes. These two-letter codes are still available from the new registry, but the registry also lists 3-digit UN M.49 region codes. The advantage of these codes is that they can represent more than just countries. For example, localization groups have for some time wanted to label their carefully crafted translations as Latin-American Spanish, rather than the Spanish of any particular country. With RFC 5646 this is possible; the appropriate language tag is es-419.

Only one region subtag can appear in a language tag, and it must appear after the language subtag and any extlang and script subtags. It is a two-letter alpha or 3-digit numeric code. You can have a language code immediately followed by a region code, just as you are used to for language tags such as en-US.

Once again, you should only use region subtags if they are necessary to make a distinction you need. Unless you specifically need to highlight that you are talking about Italian as spoken in Italy you should use it for Italian, and not it-IT. The same goes for any other possible combination.

These examples from the registry show the codes for Austria, AT, and Northern Africa, 015:

%%
Type: region
Subtag: AT
Description: Austria
Added: 2005-10-16
%%
Type: region
Subtag: 015
Description: Northern Africa
Added: 2005-10-16
%%

Variant subtags

sl-nedis

sl-IT-nedis

de-CH-1901

Read more in the BCP 47 spec:

2.2.5 Variant Subtags

4.1 Choice of Language Tag

Variant subtags are values used to indicate dialects or script variations not already covered by combinations of language, script and region subtag. The variant subtags must appear after any language, script or region subtags, but are not necessarily preceded by a script or region subtag.

It is unlikely that you will need to use variant subtags unless you are working in a specialized area.

The following examples may help you understand what these subtags do.

sl-nedis (the Nadiza dialect of Slovenian)
sl-rozaj (the Rezijan dialect of Slovenian)
sl-IT-nedis (the specific variant of the Nadiza dialect of Slovenian that is spoken in Italy)
de-CH-1901 (the variant of German orthography dating from the 1901 reforms, as seen in Switzerland)

This example from the registry shows the code for the Nadiza dialect of Slovenian, nedis:

%%
Type: variant
Subtag: nedis
Description: Natisone dialect
Description: Nadiza dialect
Added: 2005-10-16
Prefix: sl
%%

In the registry these subtags are tied to a specific language (and possibly additional subtags between this subtag and the primary language subtag) by the Prefix field. The nedis example shown above should only be used with Slovenian.

If you need to express a particular dialectal or script nuance that is not currently available, you should propose a variant subtag or subtags for inclusion in the registry using the registration procedure outlined in RFC 5646.

Extension and private-use subtags

Extension subtags

de-DE-u-co-phonebk

Private use subtags

en-US-x-twain

Read more in the BCP 47 spec:

2.2.7 Private Use Subtags

2.2.6 Extension Subtags

4.1 Choice of Language Tag

If you feel you really need to use these subtags, you should read the specification, rather than this article.

Extension and private use subtags are introduced by a single letter tag, or 'singleton'. An organization can propose a singleton for an extension. Its intended use must be described by an RFC (IETF specification). The singleton will be added to the registry if it successfully passes a review. The singleton x is reserved for private use. Multiple subtags are allowed after the singleton; however, as for all subtags, they must each be 8 or less characters in length.

A locale is an identifier (such as a language tag) for a set of international preferences. Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.

Extension subtags allow for extensions to the language tag. For example, the extension subtag u has been registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.

For example, in the following tag, the u-co-phonebk extension indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.

de-DE-u-co-phonebk

The u- extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.

Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement amongst parties.

Because these subtags are only meaningful within private agreements and cannot be used interoperably across the Web, they should be used with great care, and avoided whenever possible.

The following example of a private use subtag may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.

en-US-x-twain

Grandfathered and redundant subtags

Read more in the BCP 47 spec:

2.2.8 Grandfathered and Redundant Registrations

Grandfathered tags are special cases, provided for backwards compatibility. They are tags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.

Redundant tags are language tags composed of a sequence of subtags and registered before RFC 4646 that can now be formed by combining separate subtags from the current registry. The original registrations remain in the registry mostly 'as a matter of historical curiosity'.

Many grandfathered tags have been replaced by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value field that indicates how you ought to represent that language instead. For instance, the following example of a grandfathered tag indicates that you should use the jbo language subtag instead of art-lojban.

%%
Type: grandfathered 
Tag: art-lojban 
Description: Lojban 
Added: 2001-11-11 
Deprecated: 2003-09-02 
Preferred-Value: jbo 
%%

Language tags in HTML and XML

Overview

Finding subtags in the IANA registry

Constructing language tags

The primary language subtag

The extended language subtag

The script subtag

The region subtag

Variant subtags

Extension and private-use subtags

Grandfathered and redundant subtags

Matching language tags

By the way

Further reading