Language information and text direction 

Contents

  1. Specifying the language of content: the lang attribute
    1. Inheritance of language codes
    2. Interpretation of language codes
  2. Specifying the direction of text: the dir attribute
    1. Introduction to the bidirectional algorithm
    2. Inheritance of text direction information
    3. Setting the direction of embedded text
    4. Overriding the bidrectional algorithm: the BDO element
    5. Support for character directionality and joining
    6. The effect of style sheets on bidirectionality
    7. Undisplayable characters

This section of the document discusses two important issues that affect the internationalization of HTML: specifying the language (the lang attribute) and direction (the dir attribute) of text in a document.

Specifying the language of content: the lang attribute 

Attribute definitions
lang = language-code
Specifies the primary language of an element's text content. The value of this attribute is a language code as specified by [RFC1766]. Please consult this document for authoritative information on language codes. Whitespace is not allowed within the language-code. All language-codes are case-insensitive. The default language is "unknown".

Language information can be used to control rendering of a marked up document in a variety of ways. Some situations where this information helps include:

The lang attribute's value is a language code that identifies a natural language spoken, written, or otherwise conveyed by human beings for communication of information to other human beings. Computer languages are explicitly excluded from language codes.

[RFC1766] defines and explains the language codes that must be used in HTML documents.

Briefly, language codes consist of a primary code and a possibly empty series of subcodes:

        language-code  = primary-code *( "-" subcode )

Here are some sample language codes:

Two-letter primary codes are reserved for [ISO639] language abbreviations. Two-letter codes include FR (French), DE (German), IT (Italian), NL (Dutch), EL (Greek), ES (Spanish), PT (Portuguese), AR (Arabic), HE (Hebrew), RU (Russian), ZH (Chinese), JA (Japanese), HI (Hindi), UR (Urdu), and SA (Sanskrit).

Any two-letter subcode is understood to be a [ISO3166] country code.

Inheritance of language codes 

An element inherits language code information according to the following order of precedence (highest to lowest):

In this example, the primary language of the document is French ("fr"). One paragraph is declared to be in Spanish ("es"), after which the primary language returns to French. The following paragraph includes an embedded Japanese ("ja") phrase, after which the primary language returns to French.

<HTML lang="fr">
<BODY>
...Interpreted as French...
<P lang="es">...Interpreted as Spanish...
<P>...Interpreted as French again...
<P>...French text interrupted by<EM lang="ja">some
         Japanese</EM>French begins here again...
</BODY>
</HTML>

Interpretation of language codes 

In the context of HTML, a language code should be interpreted by user agents as a hierarchy of tokens rather than a single token. When a user agent adjusts rendering according to language information (say, by comparing style sheet language codes and lang values), it should always favor an exact match, but should also consider matching primary codes to be sufficient. Thus, if the lang attribute value of "en-US" is set for the HTML element, a user agent should prefer style information that matches "en-US" first, then the more general value "US".

Note: Language code hierarchies do not guarantee that all languages with a common prefix will be understood by those fluent in one or more of those languages. They do allow a user to request this commonality when it is true for that user.

For artificial languages such as Elfish or Klingon, it would make sense to use the lang attribute to indicate the change from the language of the enclosing context. Until the successor to [RFC1766] defines a standard way to do this, one possibility is to use the x- prefix convention, e.g. x-elfish.

Specifying the direction of text: the dir attribute 

Attribute definitions
dir = LTR | RTL
Specifies the default direction for directionally weak or neutral text in the element's content (left-to-right or right-to-left) in this document. Possible values:
  • LTR: Left-to-right text.
  • RTL: Right-to-left text.

In addition to specifying the primary language of a document, authors may need to specify the default direction of pieces of text or the text in the entire document.

The [UNICODE] specification assigns directionality to Unicode characters and defines a (complex) algorithm for determining the proper directionality of text. If a document does not contain a displayable right-to-left, a conforming user agent is not required to apply the [UNICODE]bidirectional algorithm. If a document contains a right-to-left character, and if the user agent chooses to display that character, the user agent must use the bidirectional algorithm.

Although Unicode specifies special characters that deal with text direction, HTML offers higher-level markup constructs that do the same thing: the dir attribute (do not confuse with the DIR element) and the BDO element. Thus, to express a Hebrew quotation, it is more intuitive to write

<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>

than the equivalent with Unicode references:

&#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;

User agents must not use the lang attribute to determine text directionality.

In the absence of local overrides, the default direction is inherited from enclosing elements.

Introduction to the bidirectional algorithm 

The following example illustrates the expected behavior of the bidirectional algorithm.

Consider the following example text:

  english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

The characters in this example (and in all related examples) are stored in the computer the way they are displayed here: the first character in the file is "e", the second is "n", and the last is "6".

Suppose the predominant language of the document containing this paragraph is English (left-to-right text). The correct presentation of this line would be:

english1 2WERBEH english3 4WERBEH english5 6WERBEH
         -------          -------          -------
            H                H                H
--------------------------------------------------
                       E

The dotted lines indicate the structure of the sentence: English predominates and some Hebrew text is embedded. Achieving the correct presentation requires no additional markup since the Hebrew fragments are reversed correctly by user agents applying the bidirectional algorithm.

If, on the other hand, the predominant language of the document is Hebrew (right-to-left direction), the correct presentation is:

6WERBEH english5 4WERBEH english3 2WERBEH english1
        --------         --------         --------
            E                E                E
--------------------------------------------------
                       H

In this case, the whole sentence has been presented as right-to-left and the embedded English sequences have been properly reversed by the bidirectional algorithm.

Inheritance of text direction information 

The Unicode bidirectional algorithm requires an initial text direction. To specify the base direction of a block-level element, set the element's dir attribute. The default value of the dir attribute is "ltr" (left-to-right text).

When the dir attribute is set for a block-level element, it remains in effect for the duration of the element and any nested block-level elements. Setting the dir attribute on a nested element overrides the inherited value.

To set the primary text direction for an entire document, set the dir attribute on the HTML element.

For example:

<HTML dir="RTL">
...right-to-left text...
<P dir="ltr">...left-to-right text...</P>
<P>...right-to-left text again...</P>
</HTML>

Inline elements, on the other hand, do not inherit the dir attribute. This means that an inline element without a dir attribute does not open an additional level of embedding with respect to the bidirectional algorithm.

Setting the direction of embedded text 

The [UNICODE] bidirectional algorithm automatically reverses embedded character sequences according to their inherent directionality (as illustrated by the previous examples). However, only one level of embedding can be accounted for. To achieve additional levels of embedded direction changes, you must make use of the dir attribute on an inline element.

Consider the same example text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

Suppose the predominant language of the document containing this paragraph is English. The above English sentence contains a Hebrew section extending from HEBREW2 through HEBREW4. The Hebrew section contains an English quotation (english3). The desired presentation of the text is thus:

english1 4WERBEH english3 2WERBEH english5 6WERBEH
                 -------
                    E
         ------------------------
                    H
--------------------------------------------------
                    E

To achieve two embedded direction changes, we must supply additional information, which we do by delimiting the second embedding explicitly. In this example, we use the SPAN element and the dir attribute to mark up the text:

english1 <SPAN dir="RTL">HEBREW2 english3 HEBREW4</SPAN> english5 HEBREW6

Authors may also use special Unicode characters to achieve multiply embedded direction changes. To achieve left-to-right embedding, surround embedded text with the characters LEFT-TO-RIGHT EMBEDDING ("LRE", hexadecimal 202A) and POP DIRECTIONAL FORMATTING ("PDF", hexadecimal 202C). To achieve right-to-left embedding, surround embedded text with the characters RIGHT-TO-LEFT EMBEDDING ("RTE", hexadecimal 202B) and PDF.

Using HTML directionality markup with Unicode characters. Authors and designers of authoring software should be aware that conflicts can arise if the dir attribute is used on inline elements (including BDO) concurrently with the corresponding [ISO10646] formatting characters. Preferably one or the other should be used exclusively. The markup method offers a better guarantee of document structural integrity and alleviates some problems when editing bidirectional HTML text with a simple text editor, but some software may be more apt at using the [ISO10646] characters. If both methods are used, great care should be exercised to insure proper nesting of markup and directional embedding or override, otherwise, rendering results are undefined.

Overriding the bidrectional algorithm: the BDO element 

<!ELEMENT BDO - - (%inline)*      -- I18N BiDi over-ride -->
<!ATTLIST BDO
  lang        NAME       #IMPLIED  -- [RFC1766] language value --
  dir         (ltr|rtl)  #REQUIRED -- directionality --
  >

Start tag: required, End tag: required

Attributes defined elsewhere

The bidirectional algorithm and the dir attribute generally suffice to manage embedded direction changes. However, some situations may arise when the bidirectional algorithm results in incorrect presentation. The BDO element allows authors to turn off the bidirectional algorithm for selected fragments of text.

Consider an English document containing the same text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

Suppose this sequence of characters is being read by a user agent from left-to-right (the byte stream begins with "e" and ends with "6"). The "e" in "english1" is to the left of "n", which is how authors tend to input English characters. However, the "H" in "HEBREW2" is to the left of "E", which may not be how authors of Hebrew create their documents. For example, the MIME standard ([RFC2045]) requires right-to-left character sequences in email to be ordered right-to-left in the byte stream. This conflicts with the [UNICODE] birectional algorithm, which expects Hebrew characters to be ordered left-to-right.

Thus, if "HEBREW4" in the above example were an excerpt from a Hebrew email message, it's structure would actually be "4WERBEH". A user agent applying the bidirectional algorithm would thus display the characters in the wrong order.

The easiest solution in this case is to override the bidirectional algorithm by putting the Hebrew email excerpt in a BDO element, whose dir attribute is set to "LTR":

english1 HEBREW2 english3 <BDO dir="LTR">4WERBEH</BDO> english5 HEBREW6

This tells the bidirectional algorithm "Leave me left-to-right!" and would produce the desired presentation:

english1 2WERBEH english3 4WERBEH english5 6WERBEH

The BDO should be used in scenarios where absolute control over sequence order is required (e.g., multi-language part numbers). The dir attribute is mandatory for this element.

Authors may also use special Unicode characters to override the bidirectional algorithm --- LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFT OVERRIDE (hexadecimal 202E). The POP DIRECTIONAL FORMATTING (hexadecimal 202C) character ends either bidirectional override.

Note: Recall that conflicts can arise if the dir attribute is used on inline elements (including BDO) concurrently with the corresponding [ISO10646] formatting characters.

Bidirectionality and character encoding According to [RFC1555] and [RFC1556], there are special conventions for the use of "charset" parameter values to indicate bidirectional treatment in MIME mail, in particular to distinguish between visual, implicit, and explicit directionality. The parameter value "iso-8859-8" (for Hebrew) denotes visual encoding, "iso-8859-8-i" denotes implicit bidirectionality, and "iso-8859-8-e" denotes explicit directionality.

Because HTML uses the full Unicode bidirectionality algorithm, conforming documents must be labeled as "iso-8859-8-e". Implicit bidirectionality is part of the full Unicode algorithm, so the values "iso-8859-8-i" may also be accepted, but should not be used.

The value "iso-8859-8" defines that the document is formatted visually, misusing some markup (such as TABLE with right alignment and no line wrapping) to ensure reasonable display on older user agents that do not handle bidirectionality. Such documents do not conform to the present specification. If necessary, they can be made to conform to the current specification (and at the same time will be displayed correctly on older user agents) by adding BDO markup where necessary. Contrary to what is said in [RFC1555] and [RFC1556], iso-8859-6 (Arabic) is not visual ordering.

Support for character directionality and joining 

Since ambiguities sometimes arise as to the directionality of certain characters (e.g., some situations in Arabic), the [UNICODE] specification includes characters to enable proper resolution. HTML 4.0 includes a set of named character entities that allows partial support of the Unicode bidirectional algorithm, plus some help with languages requiring contextual analysis for rendering.

The following DTD excerpt presents some of the directional entities:

   <!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->
   <!ENTITY zwj  CDATA "&#8205;"--=zero width joiner-->
   <!ENTITY lrm  CDATA "&#8206;"--=left-to-right mark-->
   <!ENTITY rlm  CDATA "&#8207;"--=right-to-left mark-->

The zwnj entity is used to block joining behavior in contexts where joining will occur but shouldn't. The zwj entity does the opposite; it forces joining when it wouldn't occur but should. For example, the Arabic letter "HEH" is used to abbreviate "Hijri", the name of the Islamic calendar system. Since the isolated form of "HEH" looks like the digit five as employed in Arabic script (based on Indic digits), in order to prevent confusing "HEH" as a final digit five in a year, the initial form of "HEH" is used. However, there is no following context (i.e., a joining letter) to which the "HEH" can join. The zwj character provides that context.

Similarly, in Persian texts, there are cases where a letter that normally would join a subsequent letter in a cursive connection should not. The character zwnj is used to block joining in such cases.

The other characters, lrm and rlm, are used to disambiguate directionality of directionally neutral characters. For example, if a double quotation mark comes between an Arabic and a Latin letter, the direction of the quotation mark is not clear (is it quoting the Arabic text or the Latin text?). The lrm and rlm characters have a directional property but no width and no word/line break property. Please consult [UNICODE] for more details.

Reversed character glyphs: The bidirectional algorithm reverses the presentation of a well-defined set of characters such as parentheses (see [UNICODE], table 4-7). Except for these characters, bidirectionality processing leaves the shape of each glyph unaffected. Thus, if you wanted to display the word "MURDER" as it would be seen in a mirror (right-to-left character order and reversed glyphs), you could use a BDO element with the dir attribute to set the text direction to right-to-left order, e.g.,

<BDO class="mirror" dir="rtl">MURDER</BDO>

and the class value "mirror" with a matching rule in the style sheet to select a special font that displays characters with the reversed glyphs.

The effect of style sheets on bidirectionality 

In general, changing an element from being displayed in block from to inline or vice-versa due to a style sheet is straightforward. However, because the difference between block elements and inline elements is crucial for the bidirectional algorithm, special care must be taken.

When an inline element that does not have a dir attribute is transformed to a block element by a style sheet, it inherits the dir attribute from the englobing element to define the base direction of the block.

When a block element that does not have a dir attribute is transformed to an inline element by a style sheet, the resulting presentation should be equivalent, in terms of bidirectional formatting, to the formatting obtained by explicitly adding a dir attribute (assigned the inherited value) to the transformed element.

Undisplayable characters 

User agents may not be able to render meaningfully all character values, for instance, because of the lack of an appropriate font, or because a character has a value which is inexpressible with the internal character encoding.

Because there are many different things that can be done in such a case, this document does not prescribe any specific behavior. Depending on the implementation, this may also be handled by the underlying display system and not the application itself. This specification recommends the following behavior for user agents:

  1. Adopt a clearly visible, but unobtrusive mechanism to alert the user of missing resources.
  2. If the user agent provides a numeric representation of missing characters, the hexadecimal (not decimal) form is preferable as this is the form used in character set standards (see [ERCS]).