[csswg-drafts] [css-text-3] Segment Break Transformation Rules around CJK Punctuation (#5086)

MurakamiShinyu has just created a new issue for https://github.com/w3c/csswg-drafts:

== [css-text-3] Segment Break Transformation Rules around CJK Punctuation ==
(There are related discussion in #4992, #5017, and https://github.com/w3c/jlreq/issues/211)

(I wrote about this topic in Japanese at https://lists.w3.org/Archives/Public/public-i18n-japanese/2020AprJun/0232.html and its thread)


When I write Japanese text with manual line breaks, I prefer to insert line breaks after ideographic/fullwidth full stop or comma [。、.,] rather than between Kanji/Hiragana/Katakana letters, because full stop and comma are break points in thought and I can naturally press the Enter key there. So it is very important to be able to put line breaks after CJK punctuation, without causing extra space. This is not just my personal preference, but common to many people, I guess. (I believe it's same for Chinese, and also for Korean when using CJK punctuation.)

e.g.,

```
日本語のテキストに、
English textを埋め込む。
```

should be transformed to

```
日本語のテキストに、English textを埋め込む。
```

and not to

```
日本語のテキストに、 English textを埋め込む。
```

However, the current draft's [Segment Break Transformation Rules](https://drafts.csswg.org/css-text-3/#line-break-transform) do not meet this requirement. According to these rules, the segment break is discarded only if both the characters before and after the segment break belong to the space-discarding character set, and converted to a space otherwise.


## Line break treatment in TeX with CJK support

TeX has been used for Japanese typesetting since a Japanese TeX, pTeX, was developed in 1987. The pTeX and its derivatives and successors have the following line break treatment:

- If the character before the line break is a Japanese character, then the line break is removed.
- Otherwise, the line break is converted to a space.

This is the de facto standard for Japanese TeX users over the last 30 years.

(See [LuaTEX-ja document](http://ftp.jaist.ac.jp/pub/CTAN/macros/luatex/generic/luatexja/doc/luatexja-en.pdf), "13 Linebreak after a Japanese Character", for details)

With this rule, authors can put line break after Japanese punctuation, without causing extra space when a non-Japanese character follows the line break. So I think this rule has an advantage over the current CSS draft.

I am not a TeX expert and only have a limited knowledge about Japanese TeX. So I [asked TeX experts on twitter](https://twitter.com/MurakamiShinyu/status/1260131650509262850) and got some useful information.

- https://twitter.com/watayan/status/1260142719562731525
  > (Translation from Japanese) When writing in TeX, I appreciate the rule of "depending on the character at the end of a line". And when writing HTML, I put line breaks only where extra space is tolerable, with a feeling of giving up.

Such Japanese users will be disappointed if the Segment Break Transformation Rules cause extra space between Japanese punctuation and non-Japanese character.

- https://twitter.com/zr_tex8r/status/1260150913118818304
  > (Translation from Japanese) In the case of "XeLaTeX + xeCJK package" which is a "TeX for Chinese" widely used in China, the rule (simplified) is "Ignore line break if both before and after the line break are CJK" by default. It can be changed by setting.
  - https://twitter.com/zr_tex8r/status/1261663712076685313
    > (Translation from Japanese) I tried to typeset the following sources in the default settings of xeCJK. Unexpectedly, all three outputs are same: "no extra space occurs".
    ```
    中文。 English。

    中文。English。

    中文。
    English。
    ```

This behavior in "XeLaTeX + xeCJK package" is very interesting to me. I found the following description in the [README of xeCJK](http://ftp.jaist.ac.jp/pub/CTAN/macros/xetex/latex/xecjk/README.md):

> - Spaces automatically ignored between CJK characters.
> - Special effects on full-width CJK punctuation.
> - Automatic adjustment of the space between CJK and other characters.

In XeLaTeX + xeCJK, line breaks in the source are treated as spaces and spaces are ignored between two CJK characters. In addition, spaces are ignored between a CJK punctuation and a non-CJK character, as one of the "Special effects on full-width CJK punctuation". Same as Japanese TeX (pTeX etc.), authors can put line break after CJK punctuation without causing extra space when a non-CJK character follows the line break.


## Proposal to fix Segment Break Transformation Rules

I propose to add one rule to the [Segment Break Transformation Rules](https://drafts.csswg.org/css-text-3/#line-break-transform) before the last "Otherwise … converted to a space":

- Otherwise, if either the character before or after the segment break belongs to the space-discarding character set and is a Unicode Punctuation (P*) or Space Separator (Zs), then the segment break is removed.

(U+3000, ideographic space, is probably the only character that belongs to the space-discarding character set and is a Space Separator Zs)

With this rule, no extra space occurs in the following examples:

```
日本語のテキストに、
English textを埋め込む。
```
↓
```
日本語のテキストに、English textを埋め込む。
```

```
日本語のテキストにEnglish text
(英語のテキスト)
を埋め込む。
```
↓
```
日本語のテキストにEnglish text(英語のテキスト)を埋め込む。
```
(In this example, fullwidth parentheses are used)

```
日本語のテキスト! 
English textを埋め込む。
```
↓
```
日本語のテキスト! English textを埋め込む。
```
(In this example, there is an ideographic space U+3000 ` ` after the `!`)


Please view or discuss this issue at https://github.com/w3c/csswg-drafts/issues/5086 using your GitHub account

Received on Tuesday, 19 May 2020 17:49:09 UTC