This document describes requirements for the layout and presentation of text in languages that use the Thai script when they are used by Web standards and technologies, such as HTML, CSS, Mobile Web, Digital Publications, and Unicode.

This document describes the basic requirements for Thai script layout and text support on the Web and in eBooks. These requirements provide information for Web technologies such as CSS, HTML and digital publications about how to support users of Thai scripts. Currently the document focuses on Thai as used for the Thai language. The information here is developed in conjunction with a document that summarises gaps in support on the Web for Thai.

The editor's draft of this document is being developed by the Southeast Asian Layout Task Force, part of the W3C Internationalization Interest Group. It is published by the Internationalization Working Group. The end target for this document is a Working Group Note.

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL.

Introduction

About this document

Some text goes here.

Gap analysis

This document is pointed to by a separate document, Thai Gap Analysis, which describes gaps in support for Thai on the Web, and prioritises and describes the impact of those gaps on the user.

Wherever an unsupported feature is indentified through the gap analysis process, the requirements for that feature need to be documented. This document is where those requirements are described.

This document should contain no reference to a particular technology. For example, it should not say "CSS does/doesn't do such and such", and it should not describe how a technology, such as CSS, should implement the requirements. It is technology agnostic, so that it will be evergreen, and it simply describes how the script works. The gap analysis document is the appropriate place for all kinds of technology-specific information.

Other related resources

The document International text layout and typography index (known informally as the text layout index) points to this document and others, and provides a central location for developers and implementers to find information related to various scripts.

The W3C also maintains a tracking system that has links to github issues in W3C repositories. There are separate links for (a) requests from developers to the user community for information about how scripts/languages work, (b) issues raised against a spec, and (c) browser bugs. For example, you can find out what information developers are currently seeking, and the resulting list can also be filtered by script.

Thai Script Overview

Thai is an abugida. Consonant letters have an inherent vowel sound. Vowel-signs are attached to the consonant to produce a different vowel.

Unlike devanagari, multiple vowel-signs may be used with a single character, and those positioned to the left or right of the consonant(s) are not combining characters. Like other Southeast Asian scripts, Thai is heavily based on syllables, so where there are syllable-initial clusters, a prescript vowel-sign will be displayed at the start of the syllable, rather than immediately before the consonant it follows.

Thai is a tonal language, and the consonant characters chosen for a syllable have implications on the tone of that syllable.

Words are not separated by spaces.

Thai script summary can be read for a high level overview of characters used for the script, and some basic features. Text from that the latter part of that page was used for the initial version of this document.

Text direction

Thai is written horizontally, left to right.

Structural boundaries & markers

Word boundaries

The concept of 'word' is difficult to define in any language (see What is a word?). We will treat it as a vaguely-defined but recognisable semantic unit that is typically smaller than a phrase and may comprise one or more syllables.

Spaces are used in Thai as phrase separators, but Thai doesn't separate words in a phrase using visible spaces.

There is, however, a concept of words in the text. For example, lines are supposed to be broken at word boundaries.

รวม|ทั้ง|วิทยาการ|ด้าน|คอมพิว

Word boundaries occur where the vertical lines appear, though they are not marked by the script.

The main difficulty arises when dealing with compound words. It can often be difficult to decide whether a given string of syllables represents multiple words or a single compound word.

ตัวอย่าง|การเขียน|ภาษาไทย

ตัวอย่าง|การ|เขียน|ภาษา|ไทย

Alternative line break opportunities for Thai text using compound nouns.

The variation may be related to the operation being performed on the text (eg. line breaking in narrow newsprint columns, vs. double-click selection, vs. cursor movement, etc.), or it may just be down to personal preference,

The difference may also be contextually dependent. Wirote Aroonmanakun describes how คนขับรถ kʰon kʰàp ròt driver should be viewed as a single word in the context คนขับรถนั่งคอยอยู่ในรถ kʰon kʰàp ròt nâŋ kʰɔːj jûː nràjt the (man who works as a) driver is waiting in the car, whereas in the phrase คนขับรถผ่านแยกนี้ไม่มากนัก kʰon kʰàp ròt pʰàːn jɛ̀ːk níː mâj màːk nàk not many people drive through this intersection it would be viewed as 3 words, referring to anyone who is driving. a

Proper names, which are composed from multiple words, are also problematic, especially because there are no capital letters to distinguish them from other pieces of text. g

Zero-width space (ZWSP) & Word-joiner (WJ)

In order to manually fine-tune word-boundary detection, the invisible character U+200B ZERO WIDTH SPACE (ZWSP) can be used to create breaks. u625

To prevent a break between syllables, U+2060 WORD JOINER(WJ) can be used.

It is also important to bear in mind that Thai may be used to write various languages, in particular minority languages for which different dictionaries are needed. Since such dictionaries may not available in a given browser or other application, there is a tendency to use ZWSP in order to compensate.

Large-scale manual entry of ZWSP and WJ has potential downsides because the user cannot see them; this leads to problems with ZWSP being inserted in the wrong position, or multiple times. However, these don't set a state, so it doesn't create major issues. It would be useful, however, if an editor showed the location of these characters.

Care should also be taken when trying to match text, eg. for searching in a page. WJ should be ignored. ZWSP may or may not be ignored, depending on whether word boundaries are significant for the search.

Quotations

The default quote marks for Lao should be [U+201C LEFT DOUBLE QUOTATION MARK] at the start, and [U+201D RIGHT DOUBLE QUOTATION MARK] at the end.

When an additional quote is embedded within the first, the quote marks should be [U+2018 LEFT SINGLE QUOTATION MARK] and [U+2019 RIGHT SINGLE QUOTATION MARK].   This is according to CLDR – need to check.

Text boundaries & selection

TBD

Inter-character spacing

TBD

Line & paragraph layout

Line breaking

Although Thai doesn't use spaces or dividers between words, the expectation is that line-breaks occur at word boundaries.

Because Thai doesn't separate words, applications typically look up word boundaries in a dictionary, however, such lookup doesn't always produce the needed result, especially when dealing with compound words and proper names (see ).

To counteract these deficiencies, authors may use U+200B ZERO WIDTH SPACE and U+2060 WORD JOINER (see ).

If a dictionary fails to keep two or more syllables together as needed, it should be possible to use the Unicode character U+2060 WORD JOINER between the two syllables. This is an invisible character, equivalent to a zero-width no-break space, and used to prevent line-breaks.

Text alignment & justification

Justification in Thai adjusts blank spaces, but also makes certain adjustments to inter-character spacing.

Line height

Thai places vowel and tone marks above base characters, one above the other, and can also add combining characters below the line. The complexity of these marks means that the vertical resolution needed for clearly readable Thai text is higher than for, say, Latin text. In addition, Thai tends to adds more interline spacing than Latin text does.

พรุ่งนี้

An example of a word with combining characters above and below base characters.

List counters

Counters are used to number lists, chapter headings, etc.

[U+0E4F THAI CHARACTER FONGMAN] is the Thai bullet, which is used to mark items in lists or appears at the beginning of a verse, sentence, paragraph, or other textual segment. u625

Thai uses two counter styles:

  1. a numeric style, and
  2. an alphabetic style.

The numeric style uses the Thai digits '๐' '๑' '๒' '๓' '๔' '๕' '๖' '๗' '๘' '๙' in a decimal pattern.

1 ⇨  2 ⇨  3 ⇨  4 ⇨ 
11 ⇨ ๑๑ 22 ⇨ ๒๒ 33 ⇨ ๓๓ 44 ⇨ ๔๔
111 ⇨ ๑๑๑ 222 ⇨ ๒๒๒ 333 ⇨ 

Examples of counter values using the Thai numeric counter style.

The alphabetic style uses the following letters, in the order shown: 'ก' 'ข' 'ค' 'ง' 'จ' 'ฉ' 'ช' 'ซ' 'ฌ' 'ญ' 'ฎ' 'ฏ' 'ฐ' 'ฑ' 'ฒ' 'ณ' 'ด' 'ต' 'ถ' 'ท' 'ธ' 'น' 'บ' 'ป' 'ผ' 'ฝ' 'พ' 'ฟ' 'ภ' 'ม' 'ย' 'ร' 'ล' 'ว' 'ศ' 'ษ' 'ส' 'ห' 'ฬ' 'อ' 'ฮ'.

1 ⇨  2 ⇨  3 ⇨  4 ⇨ 
11 ⇨  22 ⇨  33 ⇨  44 ⇨ กค
111 ⇨ ขภ 222 ⇨ จด 333 ⇨ ซจ

Examples of counter values using the Thai alphabetic counter style.

In both cases, the separator is a comma. Check this - it's the assumption of the CSS spec.

Acknowledgements

Special thanks to the following people who contributed to this document (contributors' names listed in in alphabetic order).

Anousak Anthony Souphavanh, Ben Mitchell, James Clarke, John Durdin, Martin Hosken, Norbert Lindenberg.

Please find the latest info of the contributors at the GitHub contributors list.