Accesskey n skips to in-page navigation. Skip to the content start

This document contains examples in another language/script.

Go to W3C Home PageGo to Architecture Domain home page  Internationalization 
 

User feedback: ::first-letter in non-Latin scripts

This page gathers information about the use of the :first-letter pseudo-element proposed for CSS3. It includes a summary of comments made by numerous people.

We are looking for comments and discussion on this topic. To comment:

Background

The latest working draft of CSS3 Selectors proposes the::first-letter pseudo-element.

The ::first-letter pseudo-element represents the first letter of the first line of a block, if it is not preceded by any other content (such as images or inline tables) on its line. It allows that first letter to be styled individually, without markup. It may be used for "initial caps" and "drop caps", which are common typographical effects in text in Latin script.

We commented to the CSS Working Group that they need to define 'letter' more carefully, and proposed that they specify that 'letter' equates to 'default grapheme cluster', as described in the Unicode Standard Annex #29.

(A rough and ready explanation of this is that base characters and any following combining characters are styled together. So

0065: e LATIN SMALL LETTER E + 0301: ́ COMBINING ACUTE ACCENT

would be handled as a single letter.)

We also suggested that implementors should then be encouraged to provide tailored algorithms on a per language basis to cope with anomolies, particularly such as may occur in non-Latin scripts.

Here are some initial questions for which we are seeking answers:

[1] Are there scripts that would never use this approach?

[2] We mention 'initial caps' and 'drop caps' above. What other types of styling would be commonly applied in other scripts if this feature were available?

[3] What script features would cause difficulties, eg syllabic groupings (see the example of indic script example below), ligatures, cursive text (eg. Arabic, Urdu, etc.), and how would the script normally deal with them?

Indic Scripts

Indic script behavior relates to syllables, rather than individual letter forms. In the Hindi word स्थिति ('sthiti') the sequence of characters in the first syllable is as follows in memory:

0938: स DEVANAGARI LETTER SA
094D: ् DEVANAGARI SIGN VIRAMA
0925: थ DEVANAGARI LETTER THA
093F: ि DEVANAGARI VOWEL SIGN I

The displayed text, however, is The first syllable of the word 'sthiti' in Hindi, showing the positions of the characters after display.

Note how the vowel sign appears to the left of the first character, not the third.

There are two default grapheme clusters here. The first includes the SA+VIRAMA+THA+I. (The second is the last two characters, T+II.)

From the feedback we have received it appears that first-letter styling will be needed for Indic scripts. We have examples in the mail archive for such styling in Devanagari, Bengali, and Malayalam, though we have reports that it is needed for other scripts, such as Telugu. Tamil and Kannada.

We see that the styling is done on the basis of the syllable, not the first character. A syllable includes a base consonant and any combination of the following characters in the text stream:

These combinations are all default grapheme clusters NOT equivalent to default grapheme clusters, as defined by Unicode. The default grapheme cluster is only a part of an indic syllable cluster. This means that, as it stands, user agent developers will need to implement special algorithms to support first-letter styling in indic text. Such algorithms will need to automatically detect that such rules are applicable to the text being styled. It will be interesting to ascertain whether the rules vary by script only or by language. If the latter, then it is important to mark up the language of the text correctly.

Note that the order in which these characters are displayed may be different from the order in memory. Note also that there is no one to one mapping between the codes and the glyphs used. There are often ligatures, vowel signs that appear on both sides of the consonant base, etc. The styling is applied to all glyph used to represent the syllable as a whole.

The examples show a predominance of styling similar to what would be called 'drop letter' in English. Where a character is enlarged in a script has a headstroke, the height of the headstroke in the large text and the regular text is typically approximately on the same level, but commonly does not join.

The following is an example of a drop letter in Hindi.

Hindi example of a drop letter

In some cases there is additional coloring applied to a drop letter. In other cases, the coloring is the distinctive styling.

We also had some examples of increased font size without the drop letter characteristics. This example is in the Malayalam script.

Malayalam example of increase font size without drop

Cursive Scripts

Since Arabic and Mongolian letters in a word are normally joined, has first letter styling been used at all in these scripts?

Chinese, Japanese and Korean

Do languages using these scripts do first letter styling?

Cyrillic, Greek, Armenian, etc

Do languages using these scripts do first letter styling?

Other scripts

Do other scripts need first letter styling? In particular, do they have any special requirements?

Further reading

Author: Richard Ishida.

Valid XHTML 1.0!
Valid CSS!
Encoded in UTF-8!

Content first published 14 July, 2006. Last substantive update 2006-07-14 11:05 GMT. This version 2006-07-14 11:05 GMT

For the history of document changes, search for uf-firstletter in the i18n blog.