In HTML and XML we've made significant efforts to allow documents and text ranges/spans within documents to have language (@lang) and direction (@dir) information. This markup appears as general-purpose attributes at the language (HTML/XML) level, rather than as elements defined separately by each new document format. In addition, since these optional attributes are available throughout the markup system and have scope, it is easy to create documents that declare the base direction or the document language once, rather than having to repeat the metadata for every bit of data.
The Bidi in Plain Text document discusses why the Unicode Bidirectional Algorithm (UBA) needs this help.
Language information is necessary for a variety of reasons. Some examples include:
- Text rendering benefits greatly if language information is supplied. Without a language tag, languages such as Japanese or Chinese frequently present a "ransom note" appearance due to font fallback issues.
- Text processing benefits when language information is available. For example, text searching can be improved when words can be stemmed, tokenized, or compared using language-aware rules.
- Content sorting and list formation can be made more natural when language-specific noise words (in English, A, An, and The, for example) are ignored.
- Language tags provide the basis for modern locale-based APIs which are often used to parse, format, or present information extracted from data sets.
- String data may sometimes be available in multiple languages: providing language information allows for runtime selection of the appropriate language version ("localization" or "language negotiation")
In specifications that we've reviewed recently (WebVTT, Mediastream, etc.), data structures are frequently built up using DOMString to store natural language data ("text") as well as for storing data values whose serialization happens to be a string. DOMStrings are sequences of Unicode characters, so representation of international data values is not in question. However, DOMStrings do not provide any additional metadata "slots", either optional or required. Changing DOMString, which is very widely deployed, seems hopeless because of backwards compatibility requirements.
Other specifications are based on JSON-LD, for which there is a string data type, but no inherent support for direction or language. In one issue found in a recent review of ActivityStreams, the specification authors were concerned about requiring the language-metadata-bearing Map format because it was much more work to write/parse, looked unattractive, and was generally unnecessary in monolingual documents.
Fundamentally, there are four (any more?) options for addressing these issues when creating W3C data structures and formats:
- Do nothing.
- Document formats should provide separate metadata values on the document level defining the language and base direction of the format. This is best when the content itself is never of mixed language or mixed base-direction and where local overrides (such as Unicode bidi controls, et al) can be supplied by the content author.
- Document formats should provide separate metadata values on an item-by-item basis for fields that can contain natural language data. On the positive side, this solution allows document authors to tag individual items with a base direction and base language and allows for document formats to be created that solve the localization issue mentioned above. On the negative side, this introduces many new fields into document formats, most of which will be superfluous, and it complicates processing.
- Create a new datatype "LString" that includes (optional) language and direction metadata fields. LStrings would then be used instead of plain strings in data structures that contain natural language text.