White space use cases

Version: 2010-05-29 9:42

Author: Richard Ishida

This page lists use cases for white-space handling around line breaks for East- and South-East Asian scripts that don't use spaces to separate words.

East-Asian scripts include Chinese, Japanese (kanji, hiragana and katakana), Korean and Yi. South-east Asian includes such scripts as Thai, Khmer, and Myanmar. These scripts don't use spaces to separate words. In South-east Asian scripts spaces are used instead to indicate phrase boundaries.

The problem

The initial problem arises when an author or tool wants to break source code into shorter lines without introducing unwanted spaces into the displayed text.

A further issue arises where source text is automatically aligned by an editor or other tool. This typically introduces spaces at the start of a line which should not be produced for display. This may be done as the user types, or as command.

If source text is reflowed, in the document, it should also be possible to avoid the introduction of inappropriate spaces to the displayed text.

The tables

The tables below show how it is expected that source code will be rendered according to the the white-space processing rules described in CSS3. The specification is not definite about what happens if more than 4 common or inherited characters appear on either side of the line break. For the purpose of clarity, we will assume here that the implementation will fail to recognise the context of surrounding characters if there are more than four common or inherent characters alongside the line break.

We will use spaces for whitespace characters and ASCII digits for common and inherited characters. Note that whitespace can include tabs, and common and inherited characters may include full width digits, punctuation, etc.

If you click on the brief description in the left-hand column you will open a test page, though it seems that this algorithm is not yet supported in major browsers. They normally convert all line breaks into spaces.

Han characters

Use cases

description source code expected display notes
line break only
     <div>缔造真正全球通行
的万维网</div>
缔造真正全球通行的万维网
The line break should disappear since it is possible to detect East Asian characters on either side.
<4 spaces at end of line
     <div>缔造真正全球通行     
的万维网</div>
缔造真正全球通行 的万维网
The spaces at line end are not removed before line break transformation (LBT). LBT sees East Asian context on both sides so removes line break. Remaining spaces are then normalized to one space.
spaces at beginning of line
     <div>缔造真正全球通行
         的万维网</div>
缔造真正全球通行的万维网
The white space is removed in the first step of the white-space processing rules. Line break transformation then sees the context as East Asian and removes the line break.
spaces at beginning and end of line
     <div>缔造真正全球通行   
         的万维网</div>
缔造真正全球通行 的万维网
The white space on the second line is removed in the first step of the white-space processing rules. The spaces at line end are not removed before line break transformation (LBT). Line break transformation then sees the context as East Asian and removes the line break. Spaces are then normalized to one space.
multiple line breaks
     <div>缔造真正全球通行


的万维网</div>
缔造真正全球通行的万维网
The line breaks should disappear since it is possible to detect East Asian characters on either side. This test checks that previous rules for line break transformation are applied for multiple linebreaks.
multiple line breaks and spaces
     <div>缔造真正全球通行
      
      
      的万维网</div>
缔造真正全球通行的万维网
The space should be removed because it follows a line break, and the line breaks should disappear since it is possible to then detect East Asian characters on either side.
<4 common characters at line end
     <div>缔造真正全球通行123
的万维网</div>
缔造真正全球通行123的万维网
The common characters will be ignored during line break transformation. Since the context is East Asian, the line break will be removed.
>4 common characters at line end
     <div>缔造真正全球通行1234567890
的万维网</div>
缔造真正全球通行1234567890 的万维网
The common characters will not be ignored during line break transformation. Since the context therefore cannot be identified as East Asian, the line break will be changed to a space.
<4 common characters plus space at line end
     <div>缔造真正全球通行123   
的万维网</div>
缔造真正全球通行123 的万维网
This test use case should be predictable from previous use cases.
space plus <4 common characters at line start
     <div>缔造真正全球通行
        123的万维网</div>
缔造真正全球通行123的万维网
This test use case should be predictable from previous use cases.
common characters only on first line
     <div>1234567890
的万维网</div>
1234567890的万维网
During line break transformation, one side is neutral, so the line break is removed.
latin next to line break
     <div>缔造真正全球通行Latin
的万维网</div>
缔造真正全球通行Latin 的万维网
During line break transformation, the context cannot be established as East Asian, therefore the line break is converted to a space.

Comments

The basic idea seems to work. As long as space only appears at the beginning of a line, text will recombine for display with no intervening spaces.

If you want to separate text on two lines with a space, you add one (or more) at the end of a line.

This provides the author with sufficient manual control to achieve what they need.

Automatic source alignment tools will not disrupt this if they only add or delete whitespace at the beginning of a line. If they trim spaces from the end of a line, however, this will cause problems.

Text reflow can also work as long as it only removes and adds space at the beginning of a line, and doesn't force space at the end of a line to the beginning of a new line.

If there are more than 4 digits, punctuation or other common and inherited characters before or after a line break, it would be best to ensure that East Asian characters appear on both sides, otherwise unexpected spaces may appear.

On the other hand, I'm thinking that the spec should probably allow any number of common or inherited characters either side of the line break, to cope with things like large numbers, especially if additional punctuation is involved.

If not, the spec ought to at least be clear about what is the maximum limit for common characters either side of the line break for identifying the context. Otherwise, behaviour is likely to be unpredicable from implementation to implementation

Thai characters

Use cases

This table describes scenarios for Thai. Other South-East Asian languages are also written without spaces between words, and they all use spaces to identify phrasal boundaries instead - so it might be expected that a line break get replaced by a space, if it is the end of a phrase or sentence.

Text using scripts such as Khmer and Myanmar are more likely to use a zero width space (ZWSP) to indicate word boundaries than Thai, but they are also commonly written without word separators. They, too, use spaces to separate phrases or sentences.

These scripts, unlike Chinese, Japanese and Korean, usually break a line between words. (Algorithms with lookup are typically used to find word boundaries.)

description source code expected display notes
line break only
     <div>ใช้เครื่องระบบ
สากล</div>
ใช้เครื่องระบบ สากล
The line break will be replaced by a space, because it is only removed between East Asian characters.
line break only with zwsp
     <div>ใช้เครื่องระบบ&#x200B;
สากล</div>
ใช้เครื่องระบบสากล
During line break transformation the line break will be removed because of the adjacent ZWSP character.
spaces at end of line
     <div>ใช้เครื่องระบบ     
สากล</div>
ใช้เครื่องระบบ สากล
Same as for East-asian scripts.
spaces at beginning of line
     <div>ใช้เครื่องระบบ​
         สากล</div>
ใช้เครื่องระบบ สากล
The white space is removed in the first step of the white-space processing rules. Line break transformation then converts the line break to a space.
spaces at beginning of line plus zwsp
     <div>ใช้เครื่องระบบ&#x200B;
         สากล</div>
ใช้เครื่องระบบสากล
The white space is removed in the first step of the white-space processing rules. Line break transformation then removes the line break due to the ZWSP.
spaces at beginning and end of line
     <div>ใช้เครื่องระบบ   
         สากล</div>
ใช้เครื่องระบบ สากล
The white space on the second line is removed in the first step of the white-space processing rules. The spaces at line end are not removed before line break transformation (LBT). Line break transformation then converts the line break to a space. Spaces are then normalized to one space.
<4 common characters at line end
     <div>ใช้เครื่องระบบ 123
สากล</div>
ใช้เครื่องระบบ 123 สากล
The common characters will be ignored during line break transformation. The line break will be converted to a space.
<4 common characters plus space at line end
     <div>ใช้เครื่องระบบ 123   
สากล</div>
ใช้เครื่องระบบ 123 สากล
This test use case should be predictable from previous use cases.
space plus <4 common characters at line start
     <div>ใช้เครื่องระบบ
        123 สากล</div>
ใช้เครื่องระบบ 123 สากล
This test use case should be predictable from previous use cases.
latin next to line break
     <div>ใช้เครื่องระบบ Latin
สากล</div>
ใช้เครื่องระบบ Latin สากล
During line break transformation, the context cannot be established as East Asian, therefore the line break is converted to a space.
latin plus zwsp next to line break
     <div>ใช้เครื่องระบบ Latin​
สากล</div>
ใช้เครื่องระบบ Latinสากล
During line break transformation, the zwsp will remove the line break before it is converted to a space.

Comments

Although spaces are often used around numbers and embedded Latin text, and some most of the use cases above actually look fine, it would seem that in some cases you would want to be able to eliminate spaces.

Spaces are often used around numbers and embedded Latin text, and are appropriate replacements if the line boundary coincides with a phrasal or sentence boundary. For that reason, many of the uses above work quite easily.

Where you want to prevent a space appearing between characters split by a line break, you have to introduce a ZWSP alongside the line break.

I'm not convinced that this is the best strategy. For one thing, ZWSP is an invisible character, and keeping track of whether or not it is present could cause difficulties. It is also possible that an author's keyboard may not have a ZWSP key.

Furthermore, if the text is reflowed the position of the character will need to be changed or a new character added each time. This is tedious for a human, and at best very complicated for a machine.

The model used for East-Asian text seems to be simpler - if you want a space to appear, you add it at the end of the line. It may also be beneficial to use a consistent approach for both East-Asian and South-east Asian scripts.