Tibetan script Layout Requirements

W3C Group Draft Note

This document describes or points to requirements for the layout and presentation of text in languages that use the Tibetan script. The target audience is developers of Web standards and technologies, such as HTML, CSS, Mobile Web, Digital Publications, and Unicode, as well as implementers of web browsers, ebook readers, and other applications that need to render Tibetan script text.

This document describes the basic requirements for Tibetan script layout and text support on the Web and in eBooks. These requirements provide information for Web technologies such as CSS, HTML and digital publications about how to support users of Tibetan script languages. Currently the document focuses on the Tibetan script as used for Tibetan script. The information here is developed in conjunction with a document that summarises gaps in support on the Web for Tibetan script.

The editor's draft of this document is being developed by the Tibetan Layout Task Force, part of the W3C Internationalization Interest Group. It is published by the Internationalization Working Group.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL.

1. Introduction

1.1 Contributors

The initial information in this document was provided by Richard Ishida, drawing on the structure and text in Tibetan Orthography Notes.

Some additional information was based on a talk by Jianxin Yin.

胡春明 (Chunming Hu) prepared an early translation of parts of this document (now removed).

This document has been developed with contributions from participants of the Chinese Layout Requirement Task Force, with kind help from experts from 信标委中文信息处理分技术委员会及藏文信息处理工作组.

See also the GitHub contributors list for the Tibetan Asia Language Enablement project, and the discussions related to Tibetan.

1.2 About this document

The aim of this document is to describe the basic requirements for Tibetan script layout and text support on the Web and in eBooks. These requirements provide information for Web technologies such as CSS, HTML and digital publications, and for application developers, about how to support users of the Tibetan script.

The document focuses on typographic layout issues. For a deeper understanding of Tibetan script using the Tibetan script and how it works see Tibetan Orthography Notes, which includes topics such as: Phonology, Vowels, Consonants, Encoding choices, and Numbers.

This document should contain no reference to a particular technology. For example, it should not say "CSS does/doesn't do such and such", and it should not describe how a technology, such as CSS, should implement the requirements. It is technology agnostic, so that it will be evergreen, and it simply describes how the script works. The gap analysis document is the appropriate place for all kinds of technology-specific information.

1.3 Gap analysis

This document should be used alongside a separate document, Tibetan script Gap Analysis, which describes gaps in support for Tibetan script on the Web, and prioritises and describes the impact of those gaps on the user.

Gap reports are brought to the attention of spec and browser implementers, and are tracked via the Gap Analysis Pipeline. (Filter it for Tibetan)

To complement any content authored specifically for this document, the sections in the document also point to related, external information, tests, GitHub discussions, etc.

The document Language enablement index points to this document and others, and provides a central location for developers and implementers to find information related to various scripts.

The W3C also has a repository with discussion threads related to the Tibetan script, including requests from developers to the user community for information about how scripts/languages work, and a notification system that tracks issues in W3C working groups related to the Tibetan script. See a list of unresolved questions for Tibetan script experts. Each section below points to related discussions. See also the repository home page.

2. Tibetan script overview

Tibetan can be written using two different styles: དབུ་ཅན dbu can with a head, the block style of the Tibetan script used in print, pronounced u.cen; and དབུ་མེད dbu med headless, the cursive style of the Tibetan script used in shorthand and calligraphy, pronounced u.me. This page concentrates on the former. Pronunciations are based on the central, Lhasa dialect.

Historically, Tibetan text was written on loose-leaf sheets called pechas, ( དཔེ་ཆ pé.t͡ɕʰá book, scripture ). Some of the characters used and formatting approaches are different in books and pechas.

Tibetan text runs left to right in horizontal lines.

Words boundaries are not indicated. However, Tibetan words are made up of one or more units called tsheg-bar which are basically equivalent to phonological syllables. The tsheg-bar units are separated using U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG.

These tsheg-bar units are composed of structural elements that include vowel signs and consonants used as prefixes, root characters, subscripts, superscripts, suffixes, and secondary suffixes. A common realisation includes a stack and additional consonants to either side of the root consonant. These may indicate syllable-final consonant sounds, but more often than not they qualify or modify the root value, and are not associated with their nominal sound value. The actual pronunciation of Tibetan is usually much more simple than a typical romanisation would suggest. For example, the word བཀོད kǿː to create is transcribed as bkod.

Figure 1 The single-syllable word cy᷈ː string with an initial stack of three consonants plus a vowel sign. followed by a suffix consonant (to the right).

To write the sounds of the standard Lhasa dialect, Tibetan uses 28 consonant letters (plus their subjoined forms). 6 more letters are used to write Sanskrit.

A distinguishing feature of Tibetan is the set of separate code points for subjoined consonants, used to create consonant stacks. Of the 77 combining characters in the Tibetan block, 48 represent subjoined consonant forms. Unlike many other Indic scripts, the modern Tibetan orthography doesn't use a virama to create stacks.

Tibetan is an abugida with one inherent vowel. When writing the Lhasa dialect, other post-consonant vowels are represented using 4 vowel signs, all combining marks.

There are no pre-base, circumgraph, or multipart vowels in the Tibetan used to write the Llasa dialect (though there are when writing in Sanskrit).

Standalone vowels are written by adding vowel signs to either U+0F60 TIBETAN LETTER -A or U+0F68 TIBETAN LETTER A, depending on the tone.

Sanskrit vowels written in Tibetan use additional vowel signs and combining marks, some of which represent diphthongs, and some of which form circumgraphs or multipart characters, depending on the encoding.

Tone is indicated by the choice of root character and/or its associated prefixes and superscripts.

Modern Tibetan writing uses few punctuation marks or symbols, but the Tibetan script block in Unicode contains many of these.

Tibetan has its own set of numbers.

2.1 Tibetan Syllables

The following diagram shows characters in all of the syllabic positions, and lists the characters that can appear in each of the non-root locations. The two-syllable word in the example is འགྲེམས་སྟོན 'grems-ston ɖɹemton exhibition.

Picture of syllable composition.

Figure 2 Syllable composition in Tibetan

See more information about how the various parts of the tsheg-bar work together.

3. All topics

4. Text direction

4.1 Vertical text

5. Glyph shaping & positioning

5.1 Fonts & font styles

5.2 Context-based shaping & positioning

5.3 Letterform slopes, weights, & italics

6. Typographic units

6.1 Characters & encoding

6.2 Grapheme/word segmentation & selection

7. Punctuation & inline features

7.1 Phrase & section boundaries

7.2 Quotations & citations

7.3 Emphasis & highlighting

Modern texts tend to bold text for emphasis.

However, U+0F35 TIBETAN MARK NGAS BZUNG NYI ZLA may also be used to create a similar effect to underlining or to mark emphasis/honorifics.

7.4 Abbreviation, ellipsis & repetition

7.5 Inline notes & annotations

7.6 Text decoration & other inline features

7.7 Data formats & numbers

8. Line & paragraph layout

8.1 Line breaking & hyphenation

8.2 Text alignment & justification

8.3 Text spacing

8.4 Baselines, line height, etc.

8.5 Lists, counters, etc.

Tibetan numerals can be used for list counters. The Tibetan numbers are used in a simple decimal notation, ie. in the same way as European numerals; they differ only in shape.

༡ འ་ཞ་མི་རིགས་ཀྱིས་བསྐྲུན་པའི་ཤིང་གི་ཟམ་པ།

༢ ལོ་ངོ་800ཡི་ལོ་རྒྱུས་ལྡན་པའི་དགོན་རྙིང་ཆོས་པོ་དགོ།

༣ ཆི་ཅ་ཞེས་པའི་ཁྱིམ་རྒྱུད་ཀྱི་བང་སོའི་ཚོགས།

Figure 3 Examples of Tibetan counters in a list.

European numerals can also be used for list counters. The European numeral is followed by a period.

1. འ་ཞ་མི་རིགས་ཀྱིས་བསྐྲུན་པའི་ཤིང་གི་ཟམ་པ།

2. ལོ་ངོ་800ཡི་ལོ་རྒྱུས་ལྡན་པའི་དགོན་རྙིང་ཆོས་པོ་དགོ།

3. ཆི་ཅ་ཞེས་པའི་ཁྱིམ་རྒྱུད་ཀྱི་བང་སོའི་ཚོགས།

Figure 4 Examples of European numeral counters in a list.

8.6 Styling initials

9. Page & book layout

9.1 General page layout & progression

9.2 Grids & tables

9.3 Footnotes, endnotes, etc

9.4 Page headers, footers, etc

9.5 Forms & user interaction

