Understanding the New Language Tags

Intended audience: users, XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), CSS coders, schema developers (DTDs, XML Schema, RelaxNG, etc.), XSLT developers, Web project managers, and anyone who is likely to use language tags.

Warning: This article is only of historical interest, since the proposed new approach it refers to was published as RFC 4646 and RFC 4647 (collectively known as BCP 47) in September 2006, and have since been revised. The article is now out of date.

For up-to-date information about how to construct language tags see:

As the new language tags and software based on RFC 3066bis begin to appear, Addison Phillips, co-editor of RFC3066bis with Mark Davis, reviews how the new standard changes language tags, and what remains to be done.

Introduction

Language tags are identifiers used in protocols or document formats to indicate the natural language of the content, or to express a user's preference for a specific language or set of languages. Language tags can be used by a computer system to apply specific processing or formatting to the text in a language sensitive manner. For example, a language tag might be used to assist in default font selection or to select which dictionary to use in the spell checker.

Language tags can also be used to identify the audience for a document or in a language negotiation mechanism to select which content should be displayed. Web-based applications often use language tags to infer the user's locale preference (affecting number or date formats in a Web site, for example).

The standard for language tags on the Internet, which includes Web technologies, email, protocol headers, HTML, XML, IMAP, LDAP, RDF, RSS, and a potpourri of other acronyms, is a something called "BCP 47".

In November of 2005 the IETF, which is the standards body that defines these tags, approved a set of documents to update BCP 47, making changes in the structure of language tags and their use. Never heard of BCP 47? You might have heard of RFC 1766 (which was the original BCP 47) or its successor, RFC 3066, which was BCP 47 until this past November. These were the documents that defined the use of ISO 639 and ISO 3166 codes to form language tags like "fr", "en-US", "de-CH", or "ja", as well as a registry for other, more specialized, tags.

BCP stands for Best Current Practice. RFC stands for Requests for Comments. RFCs are stable and official IETF specifications, having moved past the Internet Draft stage.

Overview of the new approach

RFC 3066bis places additional restrictions on the format of a language tag, that is, language tags cannot vary as widely under RFC 3066bis as they could under either predecessor. Since only a few tags were actually registered, this doesn't impose a new burden on users or software. Any tag that was valid under RFC 3066 is still a valid RFC 3066bis tag. It is usually the right tag too, that is, users usually will not want to choose a different tag today in place of the RFC 3066 tag of yesterday.

Any implementation that could handle the registered tags of RFC 3066 should be able to handle the tags generated by RFC 3066bis, since all of the new tags were valid to register under RFC 3066. Indeed, more than half the registry consisted of tags that anticipated the addition of script subtags using the RFC 3066bis structure by the time the new rules were adopted.

Language tags still consist of a sequence of "subtags" separated by hyphens. A subtag can be between one and eight characters in length and is restricted to the ASCII letters and numbers (that is, a-z, A-Z, and 0-9). Upper and lowercase letters are not distinguished, so the tag "EN" is considered to be the same as the tag "en" or "eN".

RFC 3066bis defines each type of subtag according to its position and size in the tag. The complete syntax for these tags is shown here:

A Language-Tag consists of:
                   langtag                ; generated tag
              -or- private-use            ; a private use tag
              -or- grandfathered          ; grandfathered registrations

langtag       = (language
                 ["-" script]
                 ["-" region]
                 *("-" variant)
                 *("-" extension)
                 ["-" privateuse])

language      = "en", "ale", or a registered value

script        = "Latn", "Cyrl", "Hant" ISO 15924 codes

region        = "US", "CS", "FR" ISO 3166 codes
                "419", "019",  or UN M.49 codes

variant       = "rozaj", "nedis", "1996", multiple subtags can be used in a tag

extension     = single letter followed by additional subtags; more than one extension
                may be used in a language tag

private-use   = "x-" followed by additional subtags, as many as are required
                Note that these can start a tag or appear at the end (but not
                in the middle)

grandfathered = tags listed in the old registry that are not otherwise redundant (a closed list)

As noted above, any valid RFC 3066 language tag is also valid under the new scheme. Most tags are now composed of a sequence of subtags using the generative syntax.

The few exceptions are "grandfathered" tags: these are tags registered under RFC 3066 that don't fit the syntax above (or were obsolete before its adoption). Any existing content or software can thus continue to use these tags. There are 34 of these, of which eight are obsolete and ten more will be made obsolete in the near future. Four grandfathered tags fit the RFC 3066bis pattern, but were not made redundant initially.

The big difference with RFC 3066bis is that, excepting a few grandfathered registrations, all tags are now generative. Because ISO code lists were not always free and because they change over time, a key idea was to create a permanent, stable registry for all of the subtags valid in a language tag. This means that instead of five separate lists of codes, there is also just one table containing the values located in a single place. The IANA Language Subtag Registry still tracks the ISO standards, except that subtags are never withdrawn and there are clear rules for dealing with conflicting assignments if or when these arise.

Each type of subtag has unique length and content restrictions. The tag always begins with a language subtag - either one of the ISO 639 codes or a registered value. It can then, optionally, be followed by various subtags. Today there are five kinds of subtags that follow the language identifier: scripts, regions, variants, extensions, and private use. The order, length, and content of each subtag type is fixed, so a tag processor can always identify exactly which type of subtag it has, even if the processor doesn't have that subtag in its copy of the registry (or has no copy of the registry at all!)

Script. Script subtags we've already met. These are based on ISO 15924 and indicate the writing system. The script subtag can occur at most once (it may be omitted) and must appear directly after the language. Some languages have a field in the registry indicating that a particular script code should be "suppressed". For example, "zh-Hant" and "zh-Hans" represent Chinese written in Traditional and Simplified scripts respectively, while the language subtag "en" has a "Suppress-Script" field in the registry indicating that most English texts are written in the Latin script, discouraging a tag such as "en-Latn-US".

Region. Region subtags we've also met: these are based mostly on ISO 3166-1 code and indicate the country or regional variation. The region code can also include selected UN M.49 region codes. UN M.49 codes cover larger areas of the earth or provide for conflict resolution should ISO 3166 reassign a code already in the registry. In fact, ISO 3166 depends on UN M.49 to define what is or is not a "country" or region worthy of a code. The region code can occur at most once (it may be omitted) and must follow any language and script codes. For example, the tag "es-419" represents "Spanish as used in Latin America and the Caribbean" while "es-CO" represents "Spanish as used in Columbia".

Variants. Variant subtags are not based on an external standard. They are all individually registered values, mostly indicating particular dialects or other language variations not covered by scripts or regions. Multiple variant subtags can be included in a tag. Each variant has fields in the registry, though, indicating which subtags it is intended for use with. For example, the "nedis" subtag has a prefix of "sl" (Slovenian) since it represents a dialect of Slovenian. Variants shouldn't be used together unless one variant lists the other in its "prefix". For example, the tag "sl-IT-nedis" identifies the Nadiza dialect of Slovenian as used in Italy.

Extensions. Extensions are a mechanism whereby future additions to language tags can be standardized. Each extension has a single character subtag (a "singleton") that identifies it. Various restrictions apply to extensions and how they are formed, used, and administered. Extensions form the basis for future addition of features to language tags.

Private use. Private use subtags are not based on any standard at all. They are for use by individuals or groups that need to identify something language related that might not rise to the level of standardization. RFC 3066 included private use tags, but the whole tag was private use (this is still valid, of course). Now private use and generative subtags can be used together. The single-letter subtag "x" identifies where the private use subtags begin. For example: en-US-x-twain might identify writing by Mark Twain between two colleagues studying American literature. One benefit of this ability to mix the two is that vendors who extend language tags for proprietary reasons in the future can do so while preserving the maximum amount of interoperability between their system and others.

The new tag syntax uses length and content to distinguish each type of subtag, making it easier than ever to validate the contents of a tag, even without a copy of the registry. The following table shows a number of examples of the new tags:

Tag Form Meaning
en language English
de-AT language-region German as used in Austria
es-419 language-region (UN M.49) Spanish as used in Central and South America
de-CH-1901 language-region-variant German as used in Switzerland, orthography of 1901
sr-Cyrl language-script Serbian as written in Cyrillic
sr-Cyrl-CS language-script-region Serbian as written in Cyrillic as used in Serbia and Montenegro
sl-Latn-IT-rozaj language-script-region-variant Slovenian as written in Latin as used in Italy, Resian dialect

The great script debate

A critical point of debate during the development of the new language tags was the positioning of the script subtag after the language subtag but before the region subtag. Nothing in RFC 1766 or RFC 3066 guaranteed that the region subtag would appear in the second position and, prior to 2003, when this effort started, no registered tags existed that would clarify whether it was valid to assume that the region code, if it existed, would always appear second (it was quite clear that other values, such as script, could appear second).

Some people felt that putting scripts into the second position presented some problems. In particular, some feared that the script subtag would interfere with common language selection or language negotiation mechanisms. These mechanisms, such as the one described in RFC 2616 (HTTP 1.1) use a prefix called a "language range" which is specified by the user in order to select content.

This form of matching assumes that the user's preference "matches" a piece of content if the user's language tag is a prefix for that of the content. This selection mechanism relies on the assumption that languages which share a prefix are usually "mutually intelligible". (Note that this assumption is often wrong.) Here are some examples of prefix matching:

Language Range ... matches ... does not match
de de, de-CH, de-AT, de-DE, de-1901, de-AT-1901 en, fr-CH
de-CH de-CH, de-CH-1901, de-CH-1996 de, de-DE, de-1901, de-AT, etc.
zh-TW zh-TW zh-Hant-TW, zh-Hans-TW
zh-Hant zh-Hant, zh-Hant-TW, zh-Hant-HK zh, zh-Hans, zh-TW

Inserting the script subtag between language and region might have a negative effect on existing user requests or on content that doesn't use a script subtag. Instead of the expected match, the user might receive no content or a less accurate match. This is shown by the last example above.

On the other hand, script is usually more closely associated with language than regional variations are. Prefix matching produces more sensible results when the script subtag is closer to the language subtag than the region is. In order to work, users who require script subtags must use (or omit) them in a consistent fashion, in both their requests and their content.

Another problem was the ambiguity of RFC 3066 regarding the generative syntax. The idea of "language-dash-region" language tags was easy enough to grasp; most users didn't read RFC 3066 directly or consider the unstated-but-realized implication that other subtags might sometimes occur in the second position.

Ultimately it was decided that the closer relationship between script and language made the second position a better choice than artificially placing it last. This decision was partially guided by recognition that another type of subtag might be necessary in the future (which we'll get to later).

In addition to the subtags themselves, the new subtag registry contains information to help users select the best combination to identify a particular language. Critical to the acceptance of the position of the script subtag was the inclusion of information in the registry to make clear the need to avoid script subtags except where they add useful distinguishing information. Thus, the registry entry for the language subtag "en" (English) has a field called "Suppress-Script" indicating that the script subtag "Latn" should be avoided with that language, since virtually all English documents use the Latin script.

Note that this doesn't mean that "en-Latn" tags will never be used. There are cases where the script will provide information that distinguishes content. For example, a document that contains both Latin script and Braille might need to distinguish the two forms. However, these are unusual cases and the exception will be sensible (and even obvious) in those cases.

In any case, for virtually any content that does not use a script subtag today, it remains the best practice not to use one in the future. Languages that do use more than one script or are undergoing a script transition - such as those listed above - can and should benefit from identifying content using script subtags. Just over a year from its registration, a quick look at a search engine shows over 8000 pages in Simplified Chinese mentioning the tag "zh-Hans" alone. The generative syntax will greatly assist the use and acceptance of script subtags for languages that need them.

The IANA Language Subtag Registry

The new IANA Language Subtag Registry contains the information about each subtag which is valid for use in a language tag. The registry is a text file in a special, machine-readable, format called "record-jar". Each subtag has its own record, consisting of several lines of text, which identifies the subtags, their use, and some information useful in selecting which subtags are right for specific circumstances.

Here are some examples of some "language" subtag records:

   %%
   Type: language
   Subtag: cs
   Description: Czech
   Added: 2005-10-16
   Suppress-Script: Latn
   %%
   Type: language
   Subtag: cu
   Description: Church Slavic
   Description: Old Slavonic
   Description: Church Slavonic
   Description: Old Bulgarian
   Description: Old Church Slavonic
   Added: 2005-10-16
   %%
   Type: language
   Subtag: cv
   Description: Chuvash
   Added: 2005-10-16
   %%

Each record contains the subtag itself, its type ("language", in this case), a description (or set of descriptions), and the date that the record was added to the registry. All of the initial records have the date "2005-10-16" as shown above.

Additional information is sometimes available. For example, in the record for the Czech language (cs) above, you'll notice a field called "Suppress-Script". This field indicates that most texts in Czech are written in the Latin script and that the "Latn" script code is inappropriate for most language tags identifying content in Czech. That is, a tag like "cs-CZ" is recommended, while a tag such as "cs-Latn-CZ" is strongly discouraged.

Other fields that can appear include a "Deprecated" field that shows a date on which a particular code was deprecated. This almost always appears with another field called "Preferred-Value", which indicates a more appropriate subtag to use for that value. For example, the code "TP" was deprecated by ISO 3166 when that country changed its administration and name in 2002:

   %%
   Type: region
   Subtag: TP
   Description: East Timor
   Added: 2005-10-16
   Preferred-Value: TL
   Deprecated: 2002-11-15
   %%

The registration process can still be used to add information to or update information about specific records, as well as adding entire new subtags. Records cannot be removed and there are rules to prevent the meaning of a subtag from being "mutated" to mean something completely different.

The file itself contains a "File-Date" record, showing the last time the registry was updated. Combined with the various date fields in the records themselves, it is possible to validate any particular tag or its subtags for any given date, past or present.

Current status & remaining work

Current status

RFC 3066bis actually consists of three parts. First, there is the document that describes the syntax of language tags and the registry, as well as how language tags are maintained and so forth. This document is an Internet-Draft called "draft-ietf-ltru-registry-14" and is about 62 pages long. Then there is the initial contents of the IANA Language Subtag Registry, which, confusingly, is contained in an Internet-Draft called "draft-ietf-ltru-initial-05". This document was edited and maintained by Doug Ewell.

The IANA Language Subtag Registry is now up and running, and has even been receiving registrations.

The last piece of the puzzle is an Internet-Draft on matching of language tags. This document was being worked on at the time this was written and its current name is "draft-ietf-ltru-matching-12".

The IETF website hosts all of these documents, or you can find the latest versions of them all listed on my personal website and on the W3C site.

Matching

Matching, as noted earlier, is fairly well understood in its simplest, "prefix matching" form, which is described above in the section on scripts. However, there are some intriguing applications for RFC 3066bis style tags in matching, as well as some well-known matching schemes that were not well documented in RFC 3066. This work is, at the time of writing, awaiting completion of the Last Call process.

ISO 639-3 and Macro Languages

Despite the changes in how language tags are formed and maintained, a few cases remain which the new design does not fully address.

A notable problem is that of identifying variations of a language or within family of languages. While variant or region subtags are often useful for this purpose, some languages exhibit long-lived, stable, well-described variations that are not particularly well-described by national boundaries. In addition, ISO 639 has occasionally assigned codes to "macro-languages", which are language families that contain a number of recognizably related (but not necessarily mutually intelligible) languages.

An excellent example is once again Chinese. The ISO 639-1 code 'zh' identifies "Chinese", but the concept of Chinese encloses a number of distinct languages or dialects that share certain traits. While these languages are written very similarly (making tags such as "zh-Hant" and "zh-Hans" useful), spoken content is very different indeed. And, again, the available regional options are poor proxies for the spoken dialects (many of which are confined to mainland China).

RFC 3066bis provides part of the solution to this conundrum by reserving space for yet another kind of specialized subtag, called an "extended language subtag". These are three-letter codes that follow the primary language subtag but occur before the script subtag. There are very clear rules for when one of these subtags can be used (they must be used only with the specified prefix), and it is anticipated that a very small revision to RFC 3066bis will take place in mid-2006 to make these available.

"If the requirements for these codes exists now and we know what they are, why weren't the codes just incorporated into RFC 3066bis?" one might ask. The reason for the delay is that the basis for defining the extended language subtags is expected to be ISO 639-3. Like ISO 639-2 is a superset of ISO 639-1, ISO 639-3 defines an even larger set of language codes, based originally on the codes in the SIL Ethnologue (SIL, in fact, is the "Registration Authority" for ISO 639-3, that is, the folks who will maintain the code list in the future). ISO 639-3 also defines which languages are enclosed by which Macro Languages. Thus Mandarin Chinese (a spoken variation) will be identified by the ISO 639-3 code 'cmn' and rules will require that code, when used as a subtag, to always appear with its macrolanguage "zh" (Chinese). This will finally make is possible to tag Chinese content accurately in all dimensions:

Tag Meaning
zh-cmn Mandarin Chinese
zh-cmn-Hant Mandarin Chinese as written in Traditional script
zh-cmn-Hans Mandarin Chinese, Simplified script
zh-cmn-Hant-HK Mandarin Chinese, Traditional script, as used in Hong Kong SAR
zh-cmn-Hans-CN Mandarin Chinese, Simplified script, as used in China
zh-gan Gan Chinese
zh-hak Hakka Chinese
zh-yue Yue Chinese (Cantonese)
zh-hsn Xiang Chinese
zh-yue-Hant-HK Cantonese, Traditional script, Hong Kong SAR

There are about forty different languages other than Chinese that are defined as Macro Languages in the prototype for ISO 639-3. Most of these are minority languages and it is possible that the ability to accurately identify these language variations in content may have an impact on their preservation amongst the living languages.

In any case, extended language subtags are already fully specified and are merely waiting for ISO 639-3 to finally be official and complete before being included in the list of language subtags. Note that implementers merely need to update their copy of the registry when ISO 639-3 is added, as long as they have followed the implementation requirements already in RFC 3066bis.

Conclusion

The new version of BCP 47 provides the ability to accurately tag or request content using stable, well-defined tags. These tags address a number of long standing problems with language identification, leading, hopefully, to richer language-aware features in our software and better support for language in our documents. Understanding these tags and their format will help users adopt them and use them wisely and consistently.