Tagging text with no language

This article gives advice on how to use language markup in HTML or XML when you don't know the language of the content, or when the content is non-linguistic.

In HTML you should always identify the human language of the text, when known, using the lang attribute, so that applications such as voice browsers, style sheets, and the like can process the text in an appropriate way. The same goes for XML-based formats, where you would use the xml:lang attribute.

Suppose, however, you have some text that is not in any language, such as type samples, part numbers, illustrations of binary data, etc. How would you say that this was in no language in particular? Or how about a situation where you extracted the text from a database and it came with no linguistic information?

For information about how to set language in HTML, see Declaring language in HTML.

When the text is non-linguistic

Use the subtag zxx when the text is known to be not in any language.

This would apply for text such as type samples, part numbers, illustrations of binary data, etc. The definition of zxx in the IANA Language Subtag Registry is 'no linguistic content'.

For example:

<p>Here is a list of part numbers: <span lang="zxx">9RUI34 8XOS12 3TYY85</span>.</p>

When the language is undetermined

In HTML, use lang="". If you are using XML and the format you are using supports it, use xml:lang="", otherwise use xml:lang="und".

These values indicate that we cannot determine, for one reason or another, what the appropriate language information is, or whether the text is non-linguistic. For example, you might use an empty value for the language attribute if database text is included into a document but the database doesn't provide language information and you can't be reasonably sure what the language is. The effect would be to prevent any language information declared higher up the hierarchy of elements in the document from applying to the included text.

However you should only tag text as undetermined if you can't just leave it as is. In practice, this means you should only use this markup if the undetermined text is embedded in content that has already been labeled for language in some way, or if its use at the document level is required by the format you are using.

Advanced topics

Implications for XHTML 1.0

Legacy pages that use XHTML 1.0, and cannot be updated to HTML5 or XHTML5, should use xml:lang="und" if there is a need to express the undefined nature of some text embedded in a document, because xml:lang="" is not allowed. On the very rare occasion when the whole document is in an undefined language it is better to just not declare the default language of the document.

XML schema considerations

xml:lang="" only works if the schema that describes the format of your document allows an empty string as a value of xml:lang. For example, because the XHTML 1.0 DTDs define xml:lang in such a way that an empty string value for the xml:lang attribute is disallowed, you can't use the empty string in XHTML 1.0.

For those who are aware of how DTDs and other schemas work: The xml:lang attribute takes NMTOKEN values in the XML schema, so they cannot be empty. In your XML DTD, if possible, declare xml:lang as CDATA so that an empty value is allowed. For XML Schema users, rely on the XML schema document for the XML namespace.

Martin Dürst points out that you can redefine the XHTML format within the document to create an XHTML page that validates while using lang="" or xml:lang="". This is not recommended for widespread use, however, because such a document is no longer strictly conforming in the sense of XHTML 1.0.

By the way

This is a summary of a discussion in a thread on www-international@w3.org, and a later reprise of those ideas to which several people contributed.

Further reading