Accesskey n skips to in-page navigation. Skip to the content start.

s_gotoW3cHome Internationalization
 

Tagging text with no language

Intended audience: XML and XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), schema developers (DTDs, XML Schema, RelaxNG, etc.), and anyone who needs to know how to mark up content with language information when no language applies.

Updated 2007-10-30 18:42

Question

How do I use language markup in HTML or XML content when I don't know the language, or the content is non-linguistic?

Background

You should always identify the human language of the text, when known, in HTML or a format based on XML, so that applications such as voice browsers, style sheets, and the like can process that text. In XML-based formats you would usually use the xml:lang attribute, and in XHTML/HTML the lang and/or xml:lang attributes. (See Declaring Language in XHTML and HTML for details about language tagging in HTML.)

You can override that initial language setting for a part of the document that is in a different language, eg. some French quotation in an English document, by using the same attribute(s) around the relevant bit of text.

Suppose you have some text that is not in any language, such as type samples, part numbers, illustrations of binary data, etc. How would you say that this was in no language in particular? Or how about a situation where you extracted the text from a database and it came with no linguistic information?

Answer

There are two parts to the above question. This article uses examples with xml:lang, which is the recommended way of declaring language in XML-based formats.

When the text is non-linguistic

Use the subtag zxx when the text is known to be not in any language.

This would apply for text such as type samples, part numbers, illustrations of binary data, etc. The definition of zxx in the Language Subtag Registry is 'no linguistic content'.

For example:

<p>Here is a list of part numbers: <span xml:lang="zxx" lang="zxx">9RUI34 8XOS12 3TYY85</span>.</p>

When the language is undetermined

If the XML format you are using supports it, use xml:lang="", otherwise use xml:lang="und".

These values indicate that we cannot determine, for one reason or another, what the appropriate language information is, or whether the text is non-linguistic. For example, you might use xml:lang="" if database text is included into a document but the database doesn't provide language information and you can't be reasonably sure what the language is. The effect would be to prevent any language information declared higher up the hierarchy of elements in the document from applying to the included text.

However you should only tag text as undetermined if you can't just leave it as is. In practice, this means you should only use this xml:lang markup if the undetermined text is embedded in some content that has already been labeled for language in some way, or if its use at the document level is required by the format you are using.

xml:lang="" only works if the schema that describes the format of your document allows an empty string as a value of xml:lang. For example, because the XHTML DTDs define xml:lang in such a way that an empty string value for the xml:lang attribute is disallowed, you can't use the empty string in XHTML.

For those who are aware of how DTDs and other schemas work: The xml:lang attribute takes NMTOKEN values in the HTML schema, so they cannot be empty. In your XML DTD, if possible, declare xml:lang as CDATA so that an empty value is allowed. For XML Schema users, rely on the XML schema document for the XML namespace.

Implications for XHTML/HTML For XHTML and HTML you should use und if you need to express the undefined nature of some text embedded in a document, because (as mentioned above) xml:lang="" is not allowed. On the very rare occasion when the whole document is in an undefined language it is better to just not declare the default language of the document.

The same applies for the lang attribute in HTML (for the same reason).

By the way

This is a summary of a discussion in a thread on www-international@w3.org, and a later reprise of those ideas to which several people contributed.

Martin Dürst points out that you can redefine the XHTML/HTML format within the document to create an HTML/XHTML page that validates while using lang="" or xml:lang="". This is not recommended for widespread use, however, because such a document is no longer strictly conforming in the sense of XHTML 1.0.

Tell us what you think (English).

Subscribe to an RSS feed.

New resources

Home page news

Twitter (Home page news)

‎@webi18n

Further reading

By: Richard Ishida, W3C.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 2007-10-30. Last substantive update 2007-10-30 18:42 GMT. This version 2011-05-03 20:44 GMT

For the history of document changes, search for qa-no-language in the i18n blog.