This document describes requirements for the layout and presentation of text in languages that use the Khmer script when they are used by Web standards and technologies, such as HTML, CSS, Mobile Web, Digital Publications, and Unicode.

This document describes the basic requirements for Khmer script layout and text support on the Web and in eBooks. These requirements provide information for Web technologies such as CSS, HTML and digital publications about how to support users of Khmer scripts. Currently the document focuses on Khmer as used for the Khmer language. The information here is developed in conjunction with a document that summarises gaps in support on the Web for Khmer.

The editor's draft of this document is being developed by the Southeast Asian Layout Task Force, part of the W3C Internationalization Interest Group. It is published by the Internationalization Working Group. The end target for this document is a Working Group Note.

Introduction

About this document

Some text goes here.

Gap analysis

This document is pointed to by a separate document, Khmer Gap Analysis, which describes gaps in support for Khmer on the Web, and prioritises and describes the impact of those gaps on the user.

Wherever an unsupported feature is indentified through the gap analysis process, the requirements for that feature need to be documented. This document is where those requirements are described.

This document should contain no reference to a particular technology. For example, it should not say "CSS does/doesn't do such and such", and it should not describe how a technology, such as CSS, should implement the requirements. It is technology agnostic, so that it will be evergreen, and it simply describes how the script works. The gap analysis document is the appropriate place for all kinds of technology-specific information.

Other related resources

The document International text layout and typography index (known informally as the text layout index) points to this document and others, and provides a central location for developers and implementers to find information related to various scripts.

The W3C also maintains a tracking system that has links to github issues in W3C repositories. There are separate links for (a) requests from developers to the user community for information about how scripts/languages work, (b) issues raised against a spec, and (c) browser bugs. For example, you can find out what information developers are currently seeking, and the resulting list can also be filtered by script.

Khmer Script Overview

The script is an abugida, ie. like most Brahmi-influenced scripts, each consonant carries with it an inherent vowel. The sound following a consonant can be modified by attaching vowel signs to the consonant when writing.

Glyphs constituting a single syllable can appear on any side of the base character, and multiple diacritics are often needed to create the vowel in a syllable.

A key feature of Khmer is that there are a large number of vowel sounds, and only a few vowel signs; and there are a large number of consonant letters for only a small number of consonant sounds. This lead to a system where there are generally two consonant signs for a given sound, each belonging to one of two classes (or registers). So to determine the pronunciation of a vowel sign you start by seeing which class of consonant it follows. For example, using the two symbols for the sound k, is kɑː neck, and is kɔː mute.

Consonants stack in Khmer, but not always in a predictable way. Multiple consonants at the start and (sometimes) end of a syllable are usually stacked. Consonant clusters in a multisyllabic word also tend to stack. But syllable-final consonants, which can be one of a number of characters, often don't stack with the onset consonant of the next syllable.

The syllable is fundamental in Cambodian.

Many native Cambodian words are monosyllabic. These start with one or more consonants or an independent vowel (or a vowel sign attached to ʔɑː, which is a combination of both). Short vowels in stressed syllables are always followed by a consonant. Long vowels may not be. There are many monosyllabic words that begin with consonant clusters, and some monosyllabic words that end with clusters, although only one consonant is pronounced in syllable final position.

There are also many bisyllabic words. In many cases the first syllable in a bisyllabic word is unstressed, and the vowel is usually rendered in colloquial speech as a schwa. Some bisyllabic words are compounds, however, and this may not apply.

Polysyllabic words are usually of Sanskrit, Pali or French origin. These words tend to alternate stress across their syllables, but may not.

Text direction

Khmer is written horizontally, left to right.

Glyph shaping & positioning

Font styles

There are several distinct styles of font in Modern Khmer.

Most modern typefaces are set in an upright style (called អក្សរឈរ ʔɑːksɑː cʰɔː or អក្សរត្រង់ ʔɑːksɑː trɑŋ).w This is the style used for this page.

The text អក្សរខ្មែ in an âksâr chôr font style.

The slanted style (អក្សរជ្រៀង ʔɑːksɑː criəŋ) is used for whole documents or novels. The oblique styling has no affect on the semantics of the text.w

The text អក្សរខ្មែ in an âksâr chriĕng font style.

The round style (អក្សរមូល ʔɑːksɑː muːl) includes more ligated forms, and is used for titles and headings in Cambodian documents, books, or currency, as well as on shop signs or banners. It may also be used to emphasise important names or nouns.w

The text អក្សរខ្មែ in an âksâr mul font style.

Another style (អក្សរខម ʔɑːksɑː kʰɑːm), characterized by sharper serifs and angles and retainment of some antique characteristics, is used for yantra text in Cambodia as well as in Thailand.w

Structural boundaries & markers

Word boundaries

The concept of 'word' is difficult to define in any language (see What is a word?). We will treat it as a vaguely-defined but recognisable semantic unit that is typically smaller than a phrase and may comprise one or more syllables.

Spaces are used in Khmer as phrase separators, but Khmer doesn't separate words in a phrase using visible spaces.

Although Khmer doesn't use spaces or dividers between words, the expectation is that line-breaks occur at word boundaries.

There are three basic types of Khmer word:

  1. Single, indivisible words: eg. ជាតិ c̱āti national, វិទ្យាល័យ v̱iṯ͓ȳāḻăȳ highschool, កម្ម km̱͓m̱ mission.
  2. Words with prefixes and suffixes: eg. អន្តរជាតិ ʔṉ͓tṟc̱āti international, មហវិទ្យាល័យ m̱hv̱iṯ͓ȳāḻăȳ high school and កម្មករ km̱͓m̱kṟ workers.
  3. Compound words (combining 2, 3, or more single words): eg. ជាតិសាសន៍ c̱ātisāsṉ˟ race, កម្មផល km̱͓m̱pʰḻ karma, សកលវិទ្យាល័យ skḻv̱iṯ͓ȳāḻăȳ university.

The first two types cannot be broken, but the third type can. For example, |ជាតិ|សាសន៍|, |កម្ម|ផល|, and |សកល|វិទ្យាល័យ|. (Hong)

Text is not broken at sub-word syllable boundaries. In fact, this is particularly difficult to do algorithmically in Khmer, because syllable-final consonants are indistinguishable from consonants with an inherent vowel that constitute a new syllable. Some kind of morphological analysis is needed.

Zero-width space (ZWSP) & Word-joiner (WJ)

In order to manually fine-tune word-boundary detection, the invisible character U+200B ZERO WIDTH SPACE (ZWSP) can be used to create breaks. u625

To prevent a break between syllables, U+2060 WORD JOINER(WJ) can be used.

It is also important to bear in mind that Khmer may be used to write various languages, in particular minority languages for which different dictionaries are needed. Since such dictionaries may not available in a given browser or other application, there is a tendency to use ZWSP in order to compensate.

Large-scale manual entry of ZWSP and WJ has potential downsides because the user cannot see them; this leads to problems with ZWSP being inserted in the wrong position, or multiple times. However, these don't set a state, so it doesn't create major issues. It would be useful, however, if an editor showed the location of these characters.

Care should also be taken when trying to match text, eg. for searching in a page. WJ should be ignored. ZWSP may or may not be ignored, depending on whether word boundaries are significant for the search.

Quotations

The default quote marks for Lao should be [U+201C LEFT DOUBLE QUOTATION MARK] at the start, and [U+201D RIGHT DOUBLE QUOTATION MARK] at the end.

When an additional quote is embedded within the first, the quote marks should be [U+2018 LEFT SINGLE QUOTATION MARK] and [U+2019 RIGHT SINGLE QUOTATION MARK].   This is according to CLDR – need to check.

Text boundaries & selection

TBD

Inter-character spacing

TBD

Line & paragraph layout

Line breaking

Although Khmer doesn't use spaces or dividers between words, the expectation is that line-breaks occur at word boundaries.

Because Khmer doesn't separate words, applications typically look up word boundaries in a dictionary, however, such lookup doesn't always produce the needed result, especially when dealing with compound words and proper names (see ).

To counteract these deficiencies, authors may use U+200B ZERO WIDTH SPACE and U+2060 WORD JOINER (see ).

If a dictionary fails to keep two or more syllables together as needed, it should be possible to use the Unicode character U+2060 WORD JOINER between the two syllables. This is an invisible character, equivalent to a zero-width no-break space, and used to prevent line-breaks.

Text is not broken at sub-word syllable boundaries. In fact, this is particularly difficult to do algorithmically in Khmer, because syllable-final consonants are indistinguishable from consonants with an inherent vowel that constitute a new syllable. Some kind of morphological analysis is needed.

Text alignment & justification

Justification in Khmer adjusts blank spaces, but also makes certain adjustments to inter-character spacing.

Letter spacing

Cambodian text doesn't appear to use inter-letter spacing in running text, however it is sometimes used in signage. (@mcdurdin) See an example.

The rules for where the separations appear are still not clear, however one might expect that it keeps together base + subjoined consonants, and base consonants + vowel signs.

The situation is less clear for spacing vowel-signs such as ◌ា [U+17B6 KHMER VOWEL SIGN AA​], which are shown separated in the example linked to above.

List counters

Counters are used to number lists, chapter headings, etc.

Khmer uses two counter styles:

  1. a numeric style, and
  2. an alphabetic style.

The numeric style uses the Khmer digits '០' '១' '២' '៣' '៤' '៥' '៦' '៧' '៨' '៩' in a decimal pattern.

1 ⇨  2 ⇨  3 ⇨  4 ⇨ 
11 ⇨ ១១ 22 ⇨  33 ⇨  44 ⇨ 
111 ⇨  222 ⇨  333 ⇨ 

Examples of counter values using the Khmer numeric counter style.

The alphabetic style uses the following letters, in the order shown: 'ក' 'ខ' 'គ' 'ឃ' 'ង' 'ច' 'ឆ' 'ជ' 'ឈ' 'ញ' 'ដ' 'ឋ' 'ឌ' 'ឍ' 'ណ' 'ត' 'ថ' 'ទ' 'ធ' 'ន' 'ប' 'ផ' 'ព' 'ភ' 'ម' 'យ' 'រ' 'ល' 'វ' 'ស' 'ហ' 'ឡ' 'អ'.

1 ⇨  2 ⇨  3 ⇨  4 ⇨ 
11 ⇨  22 ⇨  33 ⇨  44 ⇨ កដ
111 ⇨ គឋ 222 ⇨ ចភ 333 ⇨ ញគ

Examples of counter values using the Khmer alphabetic counter style.

In both cases, the separator is a comma. Check this - it's the assumption of the CSS spec.

Acknowledgements

Special thanks to the following people who contributed to this document (contributors' names listed in in alphabetic order).

Ben Mitchell, Danh Hong, Marc Durdin, Martin Hosken.

Please find the latest info of the contributors at the GitHub contributors list.