Accesskey n skips to in-page navigation. Skip to the content start.
Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), XSLT developers, Web project managers, standards implementers, and anyone who needs an overview of how language tags are constructed using BCP47.
Language tags are used to indicate the language of text in HTML and XML documents. Use the lang attribute to specify language tags in HTML, and the xml:lang attribute for XML
In both cases, language information is inherited by elements inside the one where the declaration was made, unless one of those elements declares a different language (in the same way).
Terminology
In this article we refer to the value of a language attribute such as fr-CA as a language
tag. The fr and CA parts are referred to as subtags when described as parts of a tag.
When described as members of an ISO list of languages or countries, fr and CA are referred to as codes.
Language tag syntax is defined by the IETF's BCP 47. BCP stands for 'Best Current Practice', and is a persistent name for a series of RFCs whose numbers change as they are updated. The latest RFC describing language tag syntax is RFC 5646, Tags for the Identification of Languages, and it obsoletes the older RFCs 4646, 3066 and 1766.
You used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in the IANA Language Subtag Registry. We will describe the new registry below.
If you want to know how to create a language subtag, you should read Choosing a language tag. The rest of this article provides an overview of the syntax for language tags as described in BCP 47.
Most language tags consist of a two- or three-letter language subtag. Often this is followed by a two-letter or three-digit region subtag. RFC 5646 also allows for a number of additional subtags, where needed. These will be explained briefly in the next section, and include extended language, script, variant, extension and private-use subtags.
The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other
subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless
there is a particular reason that you need to say that this is Japanese as spoken in Japan.
Examples:
| Code | Language | Subtags |
|---|---|---|
| en | English | language |
| mas | Masai | language |
| fr-CA | French as used in Canada | language+region |
| es-419 | Spanish as used in Latin America | language+region |
| zh-Hans | Chinese written with Simplified script | language+script |
XML also provides a means to prevent inheritance of language using the empty string, ie. xml:lang="". Essentially, this says: I do not want to associate any language with this information.
The remainder of this article provides additional detail on how to construct language tags.
Note that the HTML 4.01 specification still recommends the use of RFC 1766 for identifying language but you should use RFC 5646 despite what the HTML specification currently says.
Some of the key differences between RFC 5646 and earlier specifications such as RFC 3066 are:
RFC 3066 essentially allowed you to compose language tags that were either a language code on its own, a language code plus a country code, or one of a small number of specially registered values in the IANA language tag registry.
RFC 5646 caters for more types of subtag, and allows you to combine them in various ways. While this may appear to make life much more complicated, generally speaking choosing language tags will continue to be a simple matter - however, where you need additional power it will be available to you. In fact, for most people, RFC 5646 should actually make life simpler in a number of ways - for one thing, there is only one place you need to look now for valid subtags.
The list below shows the various types of subtag that are available. We will work our way through these and how they are used in the sections that follow.
language-extlang-script-region-variants-extensions-privateuse
The entries in the registry follow certain conventions with regard to upper and lowercasing. For example, language tags are lower case, alphabetic region subtags are upper case, and script tags begin with an initial capital. This is only a convention! When you use these subtags you are free to do as you like.
As mentioned above, you used to find subtags by consulting the lists of codes in various ISO standards, but now you can find all subtags in one place. The IANA registry looks a little complicated at first, compared to the ISO code lists, but it is easy enough to use once you understand its structure.
The registry is a long text file. To find a language subtag, search the page for the name of that language, in English. If we search for 'French', we find a record that looks like this:
%% Type: language Subtag: fr Description: French Added: 2005-10-16 Suppress-Script: Latn %%
Note that the type of this record is 'language'. What you are looking for is the code labelled 'Subtag', which indicates a value of 'fr'.
You can find other tags in the same way. For example, to create a tag fr-CA (French as used in Canada), you would next search for 'Canada', and check that you had found a tag of type 'region'.
There are, however, some additional things you need to bear in mind when choosing subtags. For example, you should avoid subtags that are described in the registry as redundant or deprecated, and you need to use variant subtags in combination with certain other prescribed subtags. For more information about choosing subtags, read Choosing a language tag.
Richard Ishida has also created a user-friendly tool for searching the registry.
The following sections will give you more detail about specific subtags.
Language subtags
en
ast
Read more in the BCP 47 spec:
All language tags must begin with a language subtag.
Examples of simple, language-only language tags include:
These codes come from, and are kept up to date with, ISO 639 language codes.
Because RFC 3066 didn't provide a list of valid subtags and just referred users to ISO 639, there was sometimes confusion about how to tag languages when the ISO code lists contained both two-letter and three-letter codes (and sometimes more than one three-letter code). Now all valid subtags are listed in a single IANA registry, which adopts only one value from the ISO lists per language. If a two-letter ISO code is available, this will be the one in the registry. Otherwise the registry will contain one three-letter code. This should make things simpler.
When RFC 5646 was published, over 7,000 new ISO 639-3 three-letter codes were added to the Subtag Registry.
This is an example of the language code for Spanish, es, in the registry:
%% Type: language Subtag: es Description: Spanish Description: Castilian Added: 2005-10-16 Suppress-Script: Latn %%
Although the codes are case insensitive, they are commonly written lowercased, but this is merely a convention.
We will refer to extended language subtags as extlang subtags. An extlang subtag must always be preceded by a specific language subtag, there can only be one in a language tag, and it comes before any other subtags.
Examples of language tags including extlang subtags are:
Language+extlang combinations are provided to accommodate legacy language tag forms, however, there is a single language subtag available for every language+extlang combination. That language subtag should be used rather than the language+extlang combination, where possible. For example, use cmn rather than zh-cmn for Mandarin Chinese, and afb rather than ar-afb for Gulf Arabic.
Extlang subtags are always three letters long. Each extlang entry in the registry contains a Prefix field that specifies the language that must precede the extlang subtag. Entries also include a Preferred-Value field that indicates the equivalent language tag.
This is an example of the extlang code for Gulf Arabic, afb, in the registry:
%% Type: extlang Subtag: afb Description: Gulf Arabic Added: 2009-07-29 Preferred-Value: afb Prefix: ar Macrolanguage: ar %%
Macrolanguages The language subtags used with an extlang subtag are known as macrolanguages, and encompass a number of languages with more specific language subtags. The macrolanguage subtag can be used on its own, but unless there is some convention about its meaning in the context where it is used, it is not necessarily precise enough.
For example, zh means Chinese, but it covers many Chinese dialects, often mutually incomprehensible. It is only where a convention is applied that zh or zh-CN can be considered to represent the predominant, Mandarin form of Chinese. That said, in practice, most implementations will interpret zh as Mandarin. Where absolute clarity is needed you can use cmn instead. However, if you are using zh to represent a language which is not Mandarin, such as Hakka Chinese, you are better off using the explicit code hak.
On the other hand, zh-Hans is a useful way to describe writing in Simplified Chinese, since Chinese tends to be written in the same way, regardless of the dialect of the reader.
Script subtags
zh-Hans
az-Latn
Read more in the BCP 47 spec:
Examples of language tags including script subtags are:
The script subtag was first introduced in RFC 4646. The subtags come from, and are kept up to date with, the list of ISO 15924 script codes.
Only one script subtag can appear in a language tag, and it must immediately follow the language or any extlang subtag. It is always four letters long.
You should only use script tags if they are necessary to make a distinction you need. As RFC 4646 co-author, Addison Phillips, writes, "For virtually any content that does not use a script tag today, it remains the best practice not to use one in the future".
If you specifically want to indicate that content is not written, there is a subtag for that. For example, you could use en-Zxxx to make it clear that an audio recording in English is not written content.
Actually, many language subtag entries in the registry strongly discourage the use of script tags by including a 'Suppress script'
field. There is such a field in the Spanish example above, which indicates that Spanish is normally written using Latin script, and so the
Latn subtag should normally not be used with es.
This example shows the registry entry for Cyrillic script, Cyrl, used for languages such as Russian:
%% Type: script Subtag: Cyrl Description: Cyrillic Added: 2005-10-16 %%
Although for common uses of language tags it is not likely that you will need to specify the script, there are one or two situations that have been crying out for it for some time. One such example is Chinese. There are many Chinese dialects, often mutually unintelligible, but these dialects are all written using either Simplified or Traditional Chinese script. People typically want to label Chinese text as either Simplified or Traditional, but until recently there was no way to do so. People had to bend something like zh-CN (meaning Chinese as spoken in China) to mean Simplified Chinese, even in Singapore, and zh-TW (meaning Chinese as spoken in Taiwan) for Traditional Chinese. (Other people, however, use zh-HK for Traditional Chinese.) The availability of zh-Hans and zh-Hant for Chinese written in Simplified and Traditional scripts should improve consistency and accuracy, and is already becoming widely used, although of course you may need to continue to use the old language tags in some cases for consistency.
Region subtags
en-GB
es-005
zh-Hant-HK
Read more in the BCP 47 spec:
Examples of language tags including region subtags include:
The region subtag in RFC 3066 took its values from the ISO 3166 country codes. These two-letter codes are still available from the new
registry, but the registry also lists 3-digit UN M.49 region codes. The advantage of these codes is that they can represent more than just countries.
For example, localization groups have for some time wanted to label their carefully crafted translations as Latin-American Spanish, rather than the
Spanish of any particular country. With RFC 5646 this is possible; the appropriate language tag is es-419.
Only one region subtag can appear in a language tag, and it must appear after the language subtag and any extlang and script tags. It is a two-letter alpha or 3-digit numeric code. You can have a language code immediately followed by a region code, just as you are
used to for language tags such as en-US.
Once again, you should only use region subtags if they are necessary to make a distinction you need. Unless you specifically need to
highlight that you are talking about Italian as spoken in Italy you should use it for Italian, and not it-IT. The
same goes for any other possible combination.
These examples from the registry show the codes for Austria, AT, and Northern Africa, 015:
%% Type: region Subtag: AT Description: Austria Added: 2005-10-16 %% Type: region Subtag: 015 Description: Northern Africa Added: 2005-10-16 %%
Variant subtags
sl-nedis
sl-IT-nedis
de-CH-1901
Read more in the BCP 47 spec:
Variant subtags are values used to indicate dialects or script variations not already covered by combinations of language, script and region subtag. The variant subtags must appear after any language, script or region subtags, but script and region subtags do not need to precede them.
It is unlikely that you will need to use variant subtags unless you are working in a specialised area.
The following examples may help you understand what these subtags do.
This example from the registry shows the code for the Nadiza dialect of Slovenian, nedis:
%% Type: variant Subtag: nedis Description: Natisone dialect Description: Nadiza dialect Added: 2005-10-16 Prefix: sl %%
In the registry these subtags are tied to a specific language (and possibly additional subtags between this subtag and the language subtag) by the 'Prefix' field. The nedis example shown above can
only be used with Slovenian.
If you need to express a particular dialectal or script nuance that is not currently available, you should propose a variant subtag or subtags for inclusion in the registry using the registration procedure outlined in RFC 5646.
Private use subtags
en-US-x-twain
Read more in the BCP 47 spec:
We will mention these other subtags in passing, but if you feel you really need to use these tags, you should read the specification, rather than this article.
Extension subtags allow for future extensions to the language tag. There are no such registered tags at the moment.
Private-use subtags do not appear in the subtag registry, and are chosen and maintained by private agreement amongst parties.
Extension and private use tags are introduced by a single letter tag, or 'singleton'. The singleton for private use is
x. Note that any subtags after the singleton can only be 8 characters in length, though you can use multiple subtags.
Private use tags should be used with great care, and avoided whenever possible, since they interfere with the interoperability that RFC 5646 exists to promote.
The following example of a private use tag may identify a specific type of US English, but only within a closed community. Outside of that private agreement, its meaning cannot be relied upon.
Read more in the BCP 47 spec:
Grandfathered tags are special cases, provided for backwards compatibility. They are subtags that were registered before RFC 4646 that cannot be completely composed from the subtags in the current registry, or do not fit the syntax currently defined for language tags.
Redundant tags are language tags composed of a sequence of subtags and registered before RFC 4646 that can now be formed by combining separate subtags from the current registry. The original registrations remain in the registry mostly 'as a matter of historical curiosity'.
Many grandfathered tags have been superceded by subtags or combinations of subtags in the registry. Such grandfathered tags are now deprecated, and usually contain a Preferred-Value field that indicates how you ought to represent that language instead. For instance, the following example of a grandfathered tag indicates that you should use the jbo language subtag instead of art-lojban.
%% Type: grandfathered Tag: art-lojban Description: Lojban Added: 2001-11-11 Deprecated: 2003-09-02 Preferred-Value: jbo %%
Matching different language tags is important for a number of applications. According to BCP 47 'en' can be said to match 'en-GB'. For example, the following CSS code colors all English text red in browsers that support the pseudo-attribute :lang.
:lang(en) { color: red; }
In the following code, the text described as lang="en-GB" will be red.
<p>En janvier, toutes les boutiques de Londres affichent des panneaux <span lang="en-GB">SALE</span>, mais en fait ces magasins sont bien propres!</p>
On the other hand, given the following CSS declaration,
:lang(en-GB) { color: red; }
the word 'SALE' should not be red in the following code.
<p>En janvier, toutes les boutiques de Londres affichent des panneaux <span lang="en">SALE</span>, mais en fait ces magasins sont bien propres!</p>
With the availability of additional tags in RFC 5646, matching is a little more complicated. In addition, its companion, RFC 4647 Matching of Language Tags, describes more than one possible approach to matching. Matching will be described in another article.
Language tags for HTML were first formally defined in RFC 2070, F. Yergeau, et.al. Internationalization of the Hypertext Markup Language. RFC 2070 was incorporated into HTML 4, and has been reclassified as historic.
Note there have been changes to ISO language codes. In 1989 iw, in, and ji were withdrawn and replaced by he, id, and yi. More recently, the ISO country code cs, that used to represent Czechoslovakia, was changed to represent Serbia and Montenegro. Such changes can lead to confusion when comparing codes that were assigned to text over a long period. The new IANA subtag registry allows for tags to be deprecated and superseded by new tags, but will never remove or change the meaning of a subtag. It is expected that ISO will also follow a similar policy for the future.
Many other W3C and Web-related specifications use language tags:
lang attribute and the XML
xml:lang attribute, as well as the hreflang attribute.Note also that language information can be attached to objects such as images and included audio files.
Tell us what you think (English).
Content first published 2006-09-20. Last substantive update 2009-09-10 14:17 GMT. This version 2009-09-10 14:17 GMT
For the history of document changes, search for article-language-tags in the i18n blog.
Copyright © 2006-2009 W3C® (MIT, ERCIM, Keio), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.