Tagging text with no language information

Question

How do I mark up HTML or XML content for language when I don't know the language, or the content is non-linguistic?

Background

You should always use attributes to identify the human language of the text on the highest possible element of documents in HTML or a format based on XML, so that applications such as voice browsers, style sheets, and the like can process that text. (See Declaring Language in XHTML and HTML for details about language tagging in HTML.) In XML-based formats you would usually use the xml:lang attribute, and in XHTML/HTML the lang and/or xml:lang attributes.

You can override that initial language setting for a part of the document that is in a different language, eg. some French quotation in an English document, by using the same attribute(s) around the relevant bit of text.

Suppose you have some text that is not in any language, such as type samples, part numbers, perhaps program code. How would you say that this was no language in particular? Or how about a situation where you extracted the text from a database and it came with no linguistic information?

Answer

There are two parts to the above question.

When the text is non-linguistic

Use the subtag zxx when the text is not in any language.

This would apply for text such as type samples, part numbers, perhaps program code. The definition of zxx in the Language Subtag Registry is 'no linguistic content'.

Example:

<p>Here is a list of part numbers: 
 <span xml:lang="zxx" lang="zxx">9RUI34 8XOS12 8JOS09 3TYY85</span>.</p>

When the language is undetermined

You should only tag text as undetermined if you can't just leave it as is. If the XML format you are using supports it, use xml:lang="", otherwise use the subtag und.

These values indicate that text is in a language of some sort, but we’re just not sure which. This is different from 'this is not a language'. xml:lang="" might be used, for example, if text is included into a document from a database that doesn't provide language information with the text and where you can't be reasonably sure what the language is.

Note, however, that these constructs should only be used where the format you are using requires it, or where you have a particular need to indicate that the language is undefined. Otherwise, simply leave out the markup. In RFC 4646, Section 4.1 (Choice of Language Tag), item #4 says:

The 'und' (Undetermined) primary language subtag SHOULD NOT be used to label content, even if the language is unknown. Omitting the language tag altogether is preferred to using a tag with a primary language subtag of 'und'. The 'und' subtag MAY be useful for protocols that require a language tag to be provided. The 'und' subtag MAY also be useful when matching language tags in certain situations.

Note that xml:lang="" only works if defined in the XML schema that describes the format of your document. It is not appropriate for XHTML because, as defined in the DTDs, an empty string value for the xml:lang attribute makes your code invalid. (The xml:lang attribute takes NMTOKEN values in the schema, so they cannot be empty.)

You cannot leave the lang attribute empty in HTML, either.

For XHTML and HTML, then, you should use und if you need to express the undefined nature of some text embedded in a document, but if the whole document is in an undefined language it is better to just not declare the language at the top.

By the way

Martin Dürst points out that you can redefine the XHTML/HTML format within the document to create an HTML/XHTML page that validates while using lang="" or xml:lang="". This is not recommended for widespread use, however.

Asmus Freytag provides a list of scenarios which can apply when language tagging.

Acknowledgements

This is an attempt to summarise and move forward some ideas in a thread on www-international@w3.org by Christophe Strobbe, Martin Duerst, Bjoern Hoermann and Tex Texin.

It also draws on a thread at www-international to which the following people contributed: