From W3C Wiki
Jump to: navigation, search

GEO WG Collaborative editing page

Follow the conventions for editing this page.

Status: Initial Draft ie. please focus on technical content, rather than wordsmithing at this stage.

Author: Addison Phillips


xml:lang in XML document schemas

When should I use xml:lang and when should I define my own element or attribute for passing language values in an XML document schema (DTD)?


Sometimes documents contain or reference different types of natural language content. Other times they are needed to store a natural language value as data or meta-data about something external to the document. Because these different applications use similar formats, schema designers are sometimes confused about when they should use xml:lang and when to define their own language-related element or attribute.

For example, in XHTML 1.0, there is an hreflang attribute in the <a> element and also an xml:lang (or lang attribute, in the case of HTML 4.0) for the content of the <a> element:

<a xml:lang="en" href="xyz" hreflang="de">
   Click for German

The xml:lang attribute describes the language contained by the <a> element ("Click for German"), while the hreflang attribute is meta-data, in this case describing the language of some content external to this Web page.


When to use xml:lang

Content directly associated with the XML document (either contained within the document directly or considered part of the document when it is processed or rendered) should use the xml:lang attribute to indicate the language. xml:lang should be reserved for content authors to directly label any natural language content they may have.

xml:lang is defined by XML 1.0 as a common attribute that can be used to indicate the language of any element's contents. This includes any human readable text, as well as other content (such as embedded objects like images or sound files) contained by the element in which it appears. The xml:lang value applies to any sub-elements contained by the element. It also applies to attribute values associated with the element and sub-elements (though using natural language in attributes is not best practice). The value of the xml:lang attribute is a language tag defined by RFC 3066 or its successor.

For example, here is xml:lang on an element <t>:

<t xml:lang="en">
   This is some text contained by the >t< element. The use
   of the xml:lang attribute indicates the language so that, for
   example, the correct font could be applied when rendered or
   the correct spell-checker could be used when proofing the
   document. If we didn't have xml:lang, we might have problems
   with embeded content, such as the phrase <span xml:lang="fr">
   C'est la vie</span>, which is in another language.

This example from XHTML 1.0 shows how xml:lang applies to an attribute:

<abbr title="simple object access protocol" xml:lang="en">

Applying xml:lang to an attribute is not desirable: there is no way to supply more than one language of the title attribute, or to separate the language used in the attribute from that used in the element. Consider:

<p xml:lang="fr"><span title="anglais"><a href="qa-css-charset.en.html" lang="en" xml:lang="en">English</a></span></p>

When to use your own element or attribute

When the language value is really an attribute of or metadata about some external content, then xml:lang is not an appropriate choice. In these cases you want to store language information, but the language doesn't refer to the content of the XML document (or external content, such as images, which are processed as part of the document) directly. In this case you should define an element or attribute of using a different name and not use the xml:lang attribute. The value of the element or attribute should use RFC 3066 (or its successor), just like xml:lang.

Some examples of this might include:

  • an element in an XML document describing your DVD collection to indicate which languages are available on the soundtrack
  • an element in a customer database with a field for the customer's language preference
  • an attribute of a link element (such as {{<a>}} in XHTML) pointing to a translation of this document into another language

The reason you would choose to create your own element (or attribute) is to convey the language as a value--as part of a data structure or as meta-data about an external document--rather than to indicate the language of a specific piece of content. Avoiding the use of xml:lang to describe external language values avoids creating problems for content authors who need to label content for processing purposes.

For example, an XML document might look like this:

<item type="DVD">
  <title xml:lang="en">Casablanca</title>    <!-- indicates the language of of the text 'Casablanca' -->
  <runningTime value="137" />                <!-- not language affected -->
  <dialogue>zh-HK</dialogue>                 <!-- indicates the language of the dialogue -->
  <subtitles track="1" language="zh-Hant" /> <!-- this track contains Traditional Chinese subtitles -->
  <subtitles track="2" language="zh-Hans" /> 

In this example, the xml:lang attribute conveys information about the natural language of text appearing in this document. The dialogue element and the language attribute of the subtitles element are defined in the XML document schema and convey a natural language value associated with these items. For example, it conveys the information that the subtitles on Track #1 are written or displayed in Traditional Chinese ("zh-Hant").

By the way

It's important to remember that xml:lang has scope. This can be used to identify the language for a lot of content (without having redundant language tags on every element). For example, it is good practice to put xml:lang into your <html> element at the start of an XHTML document. For more information, see [1].