Language tags in HTML and XML

Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), XSLT developers, Web project managers, standards implementers, and anyone who needs an overview of how language tags are constructed using BCP47.

Updated

Overview

Terminology

In this article we refer to the value of a language attribute such as fr-CA as a language tag. The fr and CA parts are referred to as subtags when described as parts of a tag. When described as members of an ISO list of languages or countries, fr and CA are referred to as codes.

Language tags are used to indicate the language of text or other items in HTML and XML documents. Use the lang attribute to specify language tags in HTML, and the xml:lang attribute for XML

In both cases, language information is inherited by elements inside the one where the declaration was made, unless one of those elements declares a different language (in the same way).

RFCs are what the IETF calls its specifications. Each RFC has a unique number. Unfortunately, it is not possible to tell, when reading RFC 1766 or RFC 3066 that these specifications have been obsoleted and replaced by other specifications.

Language tag syntax is defined by the IETF's BCP 47. BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 5646, Tags for the Identification of Languages, and it obsoletes the older RFCs 4646, 3066 and 1766.

You used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in the IANA Language Subtag Registry. We will describe the new registry below.

Note! If you want step-by-step guidance for choosing a language tag, you should read Choosing a language tag. What follows here provides more of a high-level overview of the syntax and concepts involved in language tags, as described by BCP 47.

Most language tags consist of a two- or three-letter language subtag. Often this is followed by a two-letter or three-digit region subtag. RFC 5646 also allows for a number of additional subtags, where needed. These will be explained briefly in the next section, and include extended language, script, variant, extension and private-use subtags.

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

Examples:

Code Language Subtags
en English language
mas Masai language
fr-CA French as used in Canada language+region
es-419 Spanish as used in Latin America language+region
zh-Hans Chinese written with Simplified script language+script

HTML and XML also provide a means to prevent inheritance of language using the empty string, ie. xml:lang="". Essentially, this says: I do not want to associate any language with this information.

The remainder of this article provides additional detail on how to construct language tags.

Constructing language tags

Some of the key differences between RFC 5646 and earlier specifications such as RFC 3066 are:

  1. there is just one place to look for valid subtags, the new IANA registry
  2. subtags have fixed positions and lengths, which makes for easier matching of language tags
  3. there is more flexibility around the potential components of a language tag.

RFC 3066 essentially allowed you to compose language tags that were either a language code on its own, a language code plus a country code, or one of a small number of specially registered values in the IANA language tag registry.

RFC 5646 caters for more types of subtag, and allows you to combine them in various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter - however, where you need additional power it will be available to you. In fact, for most people, RFC 5646 should actually make life simpler in a number of ways – for one thing, there is only one place you need to look now for valid subtags.

Although it provides some additional options for identifying common language variations, RFC 5646 includes all of the tags that were previously valid. If you have been using RFC 1766, RFC 3066, or RFC 4646 you do not need to make any changes to your tags.

The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the sections that follow.

language-extlang-script-region-variant-extension-privateuse

The entries in the registry follow certain conventions with regard to upper and lower letter-casing. For example, language tags are lower case, alphabetic region subtags are upper case, and script tags begin with an initial capital. This is only a convention! When you use these subtags you are free to do as you like, unless you are constrained by the rules of the system you are working with. For HTML and XML language markup, the case should not matter.

Using the subtag registry

As mentioned above, you used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in one place. The IANA registry looks a little complicated at first, compared to the ISO code lists, but it is easy enough to use once you understand its structure.

The registry is a long text file. To find a language subtag, search the page for the name of that language, in English. If we search for 'French', we find a record that looks like this:

%%
Type: language
Subtag: fr
Description: French
Added: 2005-10-16
Suppress-Script: Latn
%%

Note that the type of this record is language. What you are looking for is the code labeled Subtag, which indicates a value of fr.

You can find other tags in the same way. For example, to create a tag fr-CA (French as used in Canada), you would next search for Canada, and check that you had found a tag of type region.

There are, however, some additional things you need to bear in mind when choosing subtags. For example, you should avoid subtags that are described in the registry as redundant or deprecated, and you need to use variant subtags in combination with certain other prescribed subtags. For more information about choosing subtags, read Choosing a language tag.

There is also an unofficial, user-friendly tool for searching the registry.

The following sections will give you more detail about specific subtags.

The primary language subtag

All language tags must begin with a primary language subtag.

Examples of simple, language-only language tags include:

These codes come from, and are kept up to date with, ISO 639 language codes.

Because RFC 3066 didn't provide a list of valid subtags and just referred users to ISO 639, there was sometimes confusion about how to tag languages when the ISO code lists contained both two-letter and three-letter codes (and sometimes more than one three-letter code). Now all valid subtags are listed in a single IANA registry, which adopts only one value from the ISO lists per language. If a two-letter ISO code is available, this will be the one in the registry. Otherwise the registry will contain one three-letter code. This should make things simpler.

When RFC 5646 was published, over 7,000 new ISO 639-3 three-letter codes were added to the Subtag Registry.

This is an example of the primary language subtag for Spanish, es, in the registry:

%%
Type: language
Subtag: es
Description: Spanish
Description: Castilian
Added: 2005-10-16
Suppress-Script: Latn
%%

Although the codes are case insensitive, they are commonly written lowercased, but this is merely a convention.

The extended language subtag

Extlang subtags

zh-yue
ar-afb

Read more in the BCP 47 spec:

2.2.2 Extended Language Subtags

4.1.2 Using Extended Language Subtags

We will refer to extended language subtags as extlang subtags. An extlang subtag must always be preceded by a specific primary language subtag, there can only be one in a language tag, and it comes before any other subtags.

Examples of language tags including extlang subtags are:

Language+extlang combinations are provided to accommodate legacy language tag forms, however, there is a single language subtag available for every language+extlang combination. That language subtag should be used rather than the language+extlang combination, where possible. For example, use yue rather than zh-yue for Cantonese, and afb rather than ar-afb for Gulf Arabic, if you can.

Extlang subtags are always three letters long. Each extlang entry in the registry contains a Prefix field that specifies the language that must precede the extlang subtag. Entries also include a Preferred-Value field that indicates the equivalent language tag.

This is an example of the extlang code for Gulf Arabic, afb, in the registry:

%%
Type: extlang
Subtag: afb
Description: Gulf Arabic
Added: 2009-07-29
Preferred-Value: afb
Prefix: ar Macrolanguage: ar
%%
			

Macrolanguages The primary language subtags used with an extlang subtag are known as macrolanguages, and encompass a number of languages with more specific primary language subtags. The macrolanguage subtag can be used on its own, but unless there is some convention about its meaning in the context where it is used, it is not necessarily precise enough.

For example, zh means Chinese, but it covers many Chinese dialects, often mutually incomprehensible. When zh is used on its own, it is usually used to mean the predominant language in the encompassed range, although this is not explicitly specified in BCP 47. For example, conventionally zh is considered to represent the predominant, Mandarin form of Chinese. Where absolute clarity is needed you can use cmn instead as long as that doesn't break interoperability, however, if you are using zh to represent a language which is not Mandarin, such as Hakka Chinese, you are better off using the explicit code (in that case, hak).

On the other hand, zh-Hans uses zh in its generic sense. This is a useful way to describe writing in Simplified Chinese, since Chinese tends to be written in the same way, regardless of the dialect of the reader.

The script subtag

Script subtags

zh-Hans
az-Latn

Read more in the BCP 47 spec:

2.2.3 Script Subtag

4.1 Choice of Language Tag

Examples of language tags including script subtags are:

The script subtag was first introduced in RFC 4646. The subtags come from, and are kept up to date with, the list of ISO 15924 script codes.

Only one script subtag can appear in a language tag, and it must immediately follow the language or any extlang subtag. It is always four letters long.

You should only use script tags if they are necessary to make a distinction you need. As RFC 4646 co-author, Addison Phillips, writes, "For virtually any content that does not use a script tag today, it remains the best practice not to use one in the future".

If you specifically want to indicate that content is not written, there is a subtag for that. For example, you could use en-Zxxx to make it clear that an audio recording in English is not written content.

Actually, many language subtag entries in the registry strongly discourage the use of script tags by including a Suppress script field. There is such a field in the Spanish example above, which indicates that Spanish is normally written using Latin script, and so the Latn subtag should normally not be used with es.

This example shows the registry entry for Cyrillic script, Cyrl, used for languages such as Russian:

%%
Type: script
Subtag: Cyrl
Description: Cyrillic
Added: 2005-10-16
%%

Although for common uses of language tags it is not likely that you will need to specify the script, there are one or two situations that have been crying out for it for some time. One such example is Chinese. There are many Chinese dialects, often mutually unintelligible, but these dialects are all written using either Simplified or Traditional Chinese script. People typically want to label Chinese text as either Simplified or Traditional, but until recently there was no way to do so. People had to bend something like zh-CN (meaning Chinese as spoken in China) to mean Simplified Chinese, even in Singapore, and zh-TW (meaning Chinese as spoken in Taiwan) for Traditional Chinese. (Other people, however, use zh-HK for Traditional Chinese.) The availability of zh-Hans and zh-Hant for Chinese written in Simplified and Traditional scripts should improve consistency and accuracy, and is already becoming widely used, although of course you may need to continue to use the old language tags in some cases for consistency.

The region subtag

Region subtags

en-GB
es-005
zh-Hant-HK

Read more in the BCP 47 spec:

2.2.4 Region Subtag

4.1 Choice of Language Tag

Examples of language tags including region subtags include:

The region subtag in RFC 3066 took its values from the ISO 3166 country codes. These two-letter codes are still available from the new registry, but the registry also lists 3-digit UN M.49 region codes. The advantage of these codes is that they can represent more than just countries. For example, localization groups have for some time wanted to label their carefully crafted translations as Latin-American Spanish, rather than the Spanish of any particular country. With RFC 5646 this is possible; the appropriate language tag is es-419.

Only one region subtag can appear in a language tag, and it must appear after the language subtag and any extlang and script tags. It is a two-letter alpha or 3-digit numeric code. You can have a language code immediately followed by a region code, just as you are used to for language tags such as en-US.

Once again, you should only use region subtags if they are necessary to make a distinction you need. Unless you specifically need to highlight that you are talking about Italian as spoken in Italy you should use it for Italian, and not it-IT. The same goes for any other possible combination.

These examples from the registry show the codes for Austria, AT, and Northern Africa, 015:

%%
Type: region
Subtag: AT
Description: Austria
Added: 2005-10-16
%%
Type: region
Subtag: 015
Description: Northern Africa
Added: 2005-10-16
%%

Variant subtags

Variant subtags

sl-nedis
sl-IT-nedis
de-CH-1901

Read more in the BCP 47 spec:

2.2.5 Variant Subtags

4.1 Choice of Language Tag

Variant subtags are values used to indicate dialects or script variations not already covered by combinations of language, script and region subtag. The variant subtags must appear after any language, script or region subtags, but script and region subtags do not need to precede them.

It is unlikely that you will need to use variant subtags unless you are working in a specialized area.

The following examples may help you understand what these subtags do.

This example from the registry shows the code for the Nadiza dialect of Slovenian, nedis:

%%
Type: variant
Subtag: nedis
Description: Natisone dialect
Description: Nadiza dialect
Added: 2005-10-16
Prefix: sl
%%

In the registry these subtags are tied to a specific language (and possibly additional subtags between this subtag and the primary language subtag) by the 'Prefix' field. The nedis example shown above should only be used with Slovenian.

If you need to express a particular dialectal or script nuance that is not currently available, you should propose a variant subtag or subtags for inclusion in the registry using the registration procedure outlined in RFC 5646.

Extension and private-use subtags

Extension subtags

de-DE-u-co-phonebk

Private use subtags

en-US-x-twain

Read more in the BCP 47 spec:

2.2.7 Private Use Subtags

2.2.6 Extension Subtags

4.1 Choice of Language Tag

If you feel you really need to use these subtags, you should read the specification, rather than this article.

Extension and private use subtags are introduced by a single letter tag, or 'singleton'. An organization can propose a singleton for an extension. It's intended use must be described by an RFC (IETF specification). The singleton will be added to the registry if it successfully passes a review. The singleton x is reserved for private use. Multiple subtags are allowed after the singleton; however, as for all subtags, they must each be 8 or less characters in length.

Extension subtags allow for extensions to the language tag. For example, the extension subtag u has been registered by the Unicode Consortium to add information about language or locale behavior. Many locale identifiers require additional "tailorings" or options for specific values within a language, culture, region, or other variation. This extension provides a mechanism for using these additional tailorings within language tags for general interchange.

For example, the following indicates that phonebook collation order should be used by an application, that sorted data in a document is sorted according to this collation, and so on.

The u- extension is defined in RFC 6067, which points to the Unicode Consortium's Common Locale Data Repository (CLDR) for details on the subtags that follow it. It is not defined by BCP 47.

Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement amongst parties.

Because these subtags are only meaningful within private agreements and cannot be used interoperably across the Web, they should be used with great care, and avoided whenever possible.

The following example of a private use subtag may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.

Grandfathered and redundant subtags

Read more in the BCP 47 spec:

2.2.8 Grandfathered and Redundant Registrations

Grandfathered tags are special cases, provided for backwards compatibility. They are subtags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.

Redundant tags are language tags composed of a sequence of subtags and registered before RFC 4646 that can now be formed by combining separate subtags from the current registry. The original registrations remain in the registry mostly 'as a matter of historical curiosity'.

Many grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value field that indicates how you ought to represent that language instead. For instance, the following example of a grandfathered tag indicates that you should use the jbo language subtag instead of art-lojban.

%%
Type: grandfathered 
Tag: art-lojban 
Description: Lojban 
Added: 2001-11-11 
Deprecated: 2003-09-02 
Preferred-Value: jbo 
%%

Matching language tags

Matching different language tags is important for a number of applications. According to BCP 47 en can be said to match en-GB. For example, the following CSS code colors all English text red in browsers that support the pseudo-attribute :lang.

:lang(en) { color: red; }

In the following code, the text described as lang="en-GB" will be red.

<p>En janvier, toutes les boutiques de Londres affichent des panneaux 
<span lang="en-GB">SALE</span>, mais en fait ces magasins sont bien propres!</p>

On the other hand, given the following CSS declaration,

:lang(en-GB) { color: red; }

the word 'SALE' should not be red in the following code.

<p>En janvier, toutes les boutiques de Londres affichent des panneaux 
<span lang="en">SALE</span>, mais en fait ces magasins sont bien propres!</p>

With the availability of additional tags in RFC 5646, matching is a little more complicated. In addition, its companion, RFC 4647 Matching of Language Tags, describes more than one possible approach to matching. Matching will be described in another article.

By the way

Language tags for HTML were first formally defined in RFC 2070, F. Yergeau, et.al. Internationalization of the Hypertext Markup Language. RFC 2070 was incorporated into HTML 4, and has been reclassified as historic.

Note there have been changes to ISO language codes. In 1989 iw, in, and ji were withdrawn and replaced by he, id, and yi. More recently, the ISO country code cs, that used to represent Czechoslovakia, was changed to represent Serbia and Montenegro. Such changes can lead to confusion when comparing codes that were assigned to text over a long period. The new IANA subtag registry allows for tags to be deprecated and superseded by new tags, but will never remove or change the meaning of a subtag. It is expected that ISO will also follow a similar policy for the future.

Many other W3C and Web-related specifications use language tags:

Note also that language information can be attached to objects such as images and included audio files.