Can we derive base direction from language?

Sometimes people wonder whether it's possible to obtain a definitive list of language tags which indicate a RTL base direction, so that there would be no need for separate direction metadata. This article looks into whether that is really feasible. (Spoiler: The W3C Internationalization Working Group believes it is not a feasible approach.)

For an introduction to language tags and the subtags that can be combined to form them, see Language tags in HTML and XML.

Problems in determining the direction of a string from its language tag

The bounding problem

In order to derive base direction from language metadata we would need to ascertain exhaustively and accurately which language tags represent text that is displayed with a right-to-left (RTL) base direction.

BCP47 currently defines 8,152 language subtags. (Some of those languages don’t have standardised written forms, however any language can be written as a transcription, phonetic or otherwise.)

The specific number of language subtags associated with text that has a RTL base direction can be hard to determine because many languages have multiple orthographies, some of which are LTR and others RTL. For example, the Northern Azerbaijani language,azj, is normally written in the Latin script since the country became independent in 1991 (it was Cyrillic before that), but prior to 1929 was written in the Arabic script. Another language subtag for Azerbaijani, az, can represent either the South Azerbaijani [azb] or the North Azerbaijani [azj] languages – the former typically written with the Arabic (RTL) script, and the latter with the Latin (LTR) script. Similarly, the language subtag dv (Dhivehi) represents a single language which at the present day has two different orthographies, one Latin-script based (LTR) and the other Thaana-based (RTL). Many languages have orthographies that, historically or concurrently, can be written using either a RTL or a LTR orthography.

BCP47 makes it possible to address the above ambiguities by adding script subtags to a language subtag to indicate the script that the text is written in. (Note that this is not always a simple list: a script can have multiple script subtags, depending on the style of writing, so Syriac script, for example, can be represented using syrc (General Syriac), syre (Estrangela), syrn (Serṭā), syrn (Maḏnḥāyā).) But a basic recommendation when using BCP47 is to keep the language tag as short as possible, and this means omitting script subtags unless you see a clear need for them.

Thus a user may use language tags ambiguously, eg. they may just use uz to label Uzbeck, rather than auz, or uzn, or uzs, uz-Arab, etc., and because subtags after the language subtag in BCP47 language tags are intended for contrastive use, they may feel, rightly, that that may be adequate for their needs. (uz could represent either a RTL or LTR orthography). It may not be adequate, however, if that text is scraped up and used in a contex that wasn’t originally envisaged.

An additional problem that arises where humans are labelling text for language is that, given that the list of language subtags representing strings with a RTL base direction has a long tail, we can’t really expect that users will remember accurately which tags need additional script-related information and which don’t.

It is also possible to combine other types of subtag with language tags in a way that can change the appropriate or likely base direction. For example, a list of language tags derived from CLDR included az-IR and az-IQ, which describe Azeri as used in Iran and Iraq, respectively. There is no explicit information there, and if one assumes that this refers to Azeri written in the Arabic script (rather than Latin or Cyrillic) it may not actually be the case.

Variant subtags can also signal a change in the base direction. For example, ar-alalc97 indicates a latin script transliteration of Arabic text, and similarly -fonipa, -fonapa, -fonupa can be associated with languages that are normally written with RTL scripts but that are in this case written with a LTR phonetic transcription.

A particular issue here is that the list of language subtags grows over time.

Given that variants subtags are often created for historical orthographies, and that many languages currently written with a LTR base direction (Latin, Cyrillic, Malayalam, or other) have historically been written using the Arabic script (especially in Central Asia, but also in many other parts of Asia), it would be quite easy to think of examples where the application of a new variant subtag would be associated with a new base direction for a given language. (For example, a possible new variant that indicates the Arabic orthography for Malayalam which was used at the start of the 20thC, or the Syriac orthography for Suriyani Malayalam, used widely by christians in the 19thC. Another example would be a variant subtag to indicate the Latin-based 'Yekgirtú' Kurdish Unified Alphabet currently being promoted for future use by the Kurdish Academy of Language for all Kurdish dialects, including ckb, which these days is usually written in Arabic script. And so on.)

If applications try to infer the base direction from a language tag, and the list of language tags associated with RTL base direction grows over time, it is necessary for those applications to constantly update the list they use to spot RTL text. Otherwise interoperability will suffer. Updating such lists may not always be timely or easy, especially in contexts such as the Web of Things.

One response to the issue of human fallibility is to have a short list of common language tags that are recognised to be RTL without additional clarification, and then require users to provide script subtags for all others. For example, the list of tags exempt from using script tags could include all language tags that have suppressscript metadata in BCP47. (There are only 6 of them: ar, fa, ps, ur, dv, nqo. Three of those are macrolanguage subtags, and none of the subtags they represent (eg. arb, pbu, etc.) have suppressscript metadata.)

This strategy of requiring script tags has two issues. Firstly, users applying language tags may ignore the list, or may not remember which items need script subtags and which don’t. And secondly, we are subverting the rules of BCP47, which say that language tags should be kept as short as possible, and script tags should normally only be used in a contrastive manner. Here we would be requiring them a lot (probably most) of the time, but only to service the extraction of directional information, not to necessarily contrast them with other strings. (This is not to mention the fact that typing script tags everywhere metadata is needed is a nuisance for content authors.)

So, we need to be concerned that (a) it is hard to be sure that we can exhaustively identify a list of language tags that indicate RTL base direction, and (b) we can’t be sure that the user creating the metadata will provide adequate accuracy and completeness when using language tags to allow the correct inferences to be made by the consumer from the language of the string.

Missing/incorrect language data

There are, however, additional factors which complicate the situation further.

Language information may be missing, or misleading. Most formats don’t force the user to provide language tags, and if the user omits this metadata there are serious repercussions for the consumer when it needs to apply base direction.

Even worse, especially if the strings are harvested from form fields in HTML pages, the language data may exist but may be incorrect. Take, for example, a situation where a user types a title of a book in Hebrew into a form field on an English page. It would be possible to derive the necessary base direction for the string (by running first-strong heuristics against the input text, or by obtaining the computed direction of the field if users manually setting the direction of the form field), however the language information (derived from that of the surrounding English language form) will be incorrect. If the language tag is the only thing that is stored, the consumer will get the wrong message about base direction, and all these strings will be incorrectly rendered.

Even in an HTML page solely in Arabic (or some other language using RTL script), it’s possible that the content author forgot to add language information to the page, or in some cases forgot to change it when they translated the page. Both these scenarios create a problem for encoding metadata about text direction.

Note, furthermore, that t is quite possible that, in the scenario where the language information is missing, the direction information is actually available, even though the language metadata is not. However, BCP47 has no way to represent direction alone, and fudging by using an arbitrary script subtag such as "mystring"@und-Arab" is misrepresenting the data (it might be Hebrew, or Thaana, etc).

This is also problematic for strings that represent things such as telephone numbers or part numbers, which need to have the right base direction applied, but which are not in any particular language. In other words, text direction can exist independently of language.

So, using language metadata to derive base direction can run into serious problems where the language metadata is missing or incorrect.

Processing issues

There are also significant differences between the amount of processing needed to derive direction from language, rather than from a separate metadata source.

Even if a language tag contains a script subtag, it is necessary to parse the tag to extract the script, and then compare the script against a list that gives you the base direction. If there is no script subtag, the list to scan may be quite long. For a single string or even a few strings, this may be negligible. However, we only really need to know the base direction (rather than testing the first strong character) in situations where the consumer would otherwise get it wrong – eg. where an Arabic string begins with a Latin letter, or where there are no strongly directional characters, such as telephone numbers). On the other hand, if you have a resource full of strings, you should have language metadata for every string. Because metadata trumps heuristics, in such a case you would never resort to heuristics, and would have to take the longer route of looking up and parsing the language tags for each and every string in the resource (rather than just checking for direction metadata), in order to determine the base direction. This may not take long with modern computers, but it may be more of an issue for devices in the Web of Things. It may also make it much more difficult for amateurs writing code, since they need to add a language tag parser in order to get simple information about the base direction of a string.

It may be useful to step back a little at this point. We have been looking at whether it would make life easier to obtain base direction from language tags, but we have seen that there are logistical issues and human issues involved that actually make it more complicated than it initially seemed. Even if we do make it happen, it seems that having separate direction-related metadata (only when needed), would offer a much simpler solution.

Semantic differences

Other issues arise because language and direction are fundamentally and semantically different things, and they require somewhat incompatible models of representation for metadata.

Direction attributes in HTML, CSS, SVG, etc. allow you to explicitly apply dir="auto" to some content. You can’t do that using language tags – they aren’t designed to carry such information.

Apart from all the above, having to provide language tag values rather than dir attribute values when you want to change the language, is more complicated for content authors, and less amenable to validation checking. It can also seem an odd thing to repeat the language if you only want to apply a directional override (eg. bdo), rather than change the language.

Further concerns arise if we were to attempt to rely on language tags for directional information for technologies beyond the detection of overall base direction for strings. For example, in HTML, CSS, SVG, XML applications, etc.

For one thing, direction controls should apply isolation to elements, so as to avoid 'spillover effects' related to the bidirectional algorithm in certain circumstances, such as for lists or adjacent numbers. This isolation comes with the HTML markup, can be specified by CSS, and is recommended when choosing Unicode formatting characters. Without it, many things break in content that is bidirectional. Language attributes in HTML and other markup don’t apply isolation.

Another illustration of the fact that language and direction metadata are semantically different things can be seen in situations where it is not possible to link direction with language. For example, suppose you want to display Chinese, Japanese, Berber, or Egyptian hieroglyphic text from right to left. To do so in HTML you are likely to apply some styling or use the bdo element to override the Unicode Bidirectional Algorithm and order the characters as needed. The bdo element requires some direction to be specified, but this has nothing to do with the language of the text, and the override factor cannot be derived from the language. Here we are using direction metadata in a way that is completely disconnected way from the language involved. That direction may also need to be scriptable, separately from the language.

Legacy usage

Reengineering HTML to derive direction information from lang attributes would require a massive amount of change to HTML and its parser (for little perceived benefit). Apart from adding support for the vital property of bidi isolation to the handling of the lang attribute, inheritance models would need to be carefully coordinated, behaviour would need to be changed for more than just the dir attribute, including bdi, bdo, output, and form fields, CSS would need to be changed to enable partially matched selectors to be applied to tags with language attributes for any language, and even then computed values would need to be maintained and scripted separately anyway (eg. to allow the base direction to be changed for text in Chinese); etc, etc.

(Note that language metadata in the lang attribute has no bearing on the operation of the Unicode Bidirectional Algorithm.)

In addition to the engineering problem, having worked to help users understand how to work with bidi content for quite some time now, we should expect quite a bit of confusion to arise when there are different approaches available. And there'd be new things to learn, eg. users would need to begin using language tags for ISBN numbers or MAC addresses, where they didn't before.