Brief   Full   Jump  

Small
Medium
Large

Teal
High contrast
Bluish
Black

Sans-serif
Serif
Monospaced
Close
d
?
Styles

[css-text] I18N-ISSUE-313: Definition of grapheme clusters

10 messages.

[css-text] I18N-ISSUE-313: Definition of grapheme clusters
"Phillips, Addison"   Fri, 24 Jan 2014 18:19:21 +0000

www-style > January 2014 > 0000.html

Received on Friday, 24 January 2014 18:20:55 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: www-style@w3.org, www-style@w3.org
Copied to: www-international@w3.org.

State: OPEN WG Comment Product: CSS3-text Raised by: Richard Ishida Opened on: 2013-12-11 Description: 1.3. Terminology http://www.w3.org/TR/css3-text/#terms "A grapheme cluster is what a language user considers to be a character or a basic unit of the script." "The UA may further tailor the definition as required by typographical tradition." Example 1 I think a grapheme cluster should be defined in the CSS spec as follows: A grapheme cluster is a sequence of characters as defined by the Unicode specification that should be treated as a unit for typographic processing. This generally approximates to what a language user considers to be a letter or basic unit of the script. I don't think applications should redefine what a grapheme cluster is; that definition is established by the Unicode standard. Rather, we should say that applications sometimes require additional rules beyond the use of 'grapheme clusters' in order to handle the typographic traditions of particular scripts. An appropriate example for this section of where further rules are needed is that of Devanagari syllables, where the grapheme cluster only includes part of the syllable. For an example, see the last picture on the page at http://rishida.net/docs/unicode-tutorial/part3#graphemes and the text below it. For most operations that rely on grapheme clusters, Devanagari needs additional rules to keep together the whole typographic syllable. This issue is relevant for a large proportion of complex scripts. I think that the example of the Thai behaviour may be better as a note in the letter-space and justification sections, especially since I believe that the behaviour described is not relevant for line breaking and other operations. It may be worth mentioning, also, that although the Thai examples show that U+0E33 THAI CHARACTER SARA AM needs to be decomposed first, the desired behaviour still relies on correct application of the standard grapheme cluster rules thereafter to ensure that the small circle resulting from the decomposition stays with the base character and other associated diacritics.
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
John Cowan   Fri, 24 Jan 2014 16:46:57 -0500

www-style > January 2014 > 0000.html

Received on Friday, 24 January 2014 21:47:25 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: addison@lab126.com
Copied to: www-style@w3.org, www-style@w3.org, www-international@w3.org.

Phillips, Addison scripsit: > "A grapheme cluster is what a language user considers to be a > character or a basic unit of the script." > "The UA may further tailor the definition as required by > typographical tradition." > Example 1 > > I think a grapheme cluster should be defined in the CSS spec as > follows: A grapheme cluster is a sequence of characters as defined > by the Unicode specification that should be treated as a unit > for typographic processing. This generally approximates to what a > language user considers to be a letter or basic unit of the script. > > I don't think applications should redefine what a grapheme cluster > is; that definition is established by the Unicode standard. Rather, > we should say that applications sometimes require additional > rules beyond the use of 'grapheme clusters' in order to handle > the typographic traditions of particular scripts. The definition of "grapheme cluster" in the Unicode Glossary defers to UAX 29, but the current revision (23) of that UAX doesn't actually have a formal definition of "grapheme cluster", except as a cover term for default grapheme clusters, extended grapheme clusters, and tailored grapheme clusters, which *are* defined. It does, however, introduce the informal term "user-perceived character", and says that grapheme clusters (by implication, of one of the above varieties) are an approximation to user-perceived characters. This seems to me like good terminology to follow. -- I could dance with you till the cows John Cowan come home. On second thought, I'd http://www.ccil.org/~cowan rather dance with the cows when you cowan@ccil.org come home. --Rufus T. Firefly
RE: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
"Phillips, Addison"   Fri, 24 Jan 2014 22:26:41 +0000

www-style > January 2014 > 0000.html

Received on Friday, 24 January 2014 22:27:17 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: cowan@mercury.ccil.org
Copied to: www-style@w3.org, www-style@w3.org, www-international@w3.org.

> The definition of "grapheme cluster" in the Unicode Glossary defers to UAX 29, > but the current revision (23) of that UAX doesn't actually have a formal > definition of "grapheme cluster", except as a cover term for default grapheme > clusters, extended grapheme clusters, and tailored grapheme clusters, which > *are* defined. > > It does, however, introduce the informal term "user-perceived character", and > says that grapheme clusters (by implication, of one of the above > varieties) are an approximation to user-perceived characters. The specific quote I think you refer to is: -- It is important to recognize that what the user thinks of as a "character"—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically. -- > > This seems to me like good terminology to follow. > The challenge here is that Unicode (and CSS) both define the term "character" to have a specific meaning equivalent to a Unicode codepoint, i.e. the "computer use" of the term. CSS3 Text, however, attempts to redefine and then use the term "character" to also mean a "user-perceived character". The use of the word "character" after that point is somewhat haphazard, leading to a number of problems in understanding the spec. Our primary comment is that we'd prefer to see a term other than (unadorned) "character" used where "user-perceived character" is intended. I agree that we could use "user-perceived character" instead of "grapheme cluster". My reservation about that is that a "grapheme cluster" (of various flavors and stripes) can be "determined programmatically", which is a consideration for implementation. If the "user-perceived character" cannot be determined programmatically, it is not possible to do much with it in terms of CSS. Hence, I think using the [whatever] "grapheme cluster" terminology is useful here because that is the unit that CSS will actually operate on in the cases where "user-perceived character" is intended. The ending part of my comment (which grew out of WG discussion): > ... Rather, we should say that applications sometimes require additional > rules beyond the use of 'grapheme clusters' in order to handle > the typographic traditions of particular scripts. ... suggests that some scripts require "tailored grapheme clusters" (we're aware of claims of Indic script or language requirements in this regard) but for which there is no fully-defined tailoring to point to. HTH, Addison
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
Peter Moulder   Sat, 25 Jan 2014 11:37:02 +1100

www-style > January 2014 > 0000.html

Received on Saturday, 25 January 2014 00:37:22 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: www-style@w3.org.

On Fri, Jan 24, 2014 at 10:26:41PM +0000, Phillips, Addison wrote: > The challenge here is that Unicode (and CSS) both define the term "character" to have a specific meaning equivalent to a Unicode codepoint, Unicode considers the word "character" to have several meanings: http://www.unicode.org/glossary/#character This only further supports your conclusion here: > Our primary comment is that we'd prefer to see a term other than (unadorned) "character" used where "user-perceived character" is intended. pjrm.
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
Richard Ishida   Fri, 21 Feb 2014 13:53:33 +0000

www-style > February 2014 > 0000.html

Received on Friday, 21 February 2014 13:54:03 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: www-style@w3.org
Copied to: www-style@w3.org, www-style@w3.org, www-international@w3.org.

On the subject of grapheme clusters, rather than characters, may help to note the Unicode Standard definitions here: ==== *Grapheme*. (1) A minimally distinctive unit of writing in the context of a particular writing system. For example, ‹b› and ‹d› are distinct graphemes in English writing systems because there exist distinct words like big and dig. Conversely, a lowercase italiform letter a and a lowercase Roman letter a are not distinct graphemes because no word is distinguished on the basis of these two different forms. (2) What a user thinks of as a character. *Grapheme Cluster*. The text between grapheme cluster boundaries as specified by Unicode Standard Annex #29, "Unicode Text Segmentation." (See definition D60 in Section 3.6, Combination.) A grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it. ====== The text in the spec "A grapheme cluster is what a language user considers to be a character or a basic unit of the script." is incorrect. What a user considers to be a basic unit of the script is a grapheme. A grapheme cluster is a construct with a specific desciption that tries to approximate to the user perceived graphemes (and signally fails in some contexts). If you want a vague term to refer to something that includes grapheme clusters and characters in the spec, why not use 'grapheme' rather than 'character'. RI On 24/01/2014 22:26, Phillips, Addison wrote: >> The definition of "grapheme cluster" in the Unicode Glossary defers to UAX 29, >> but the current revision (23) of that UAX doesn't actually have a formal >> definition of "grapheme cluster", except as a cover term for default grapheme >> clusters, extended grapheme clusters, and tailored grapheme clusters, which >> *are* defined. >> >> It does, however, introduce the informal term "user-perceived character", and >> says that grapheme clusters (by implication, of one of the above >> varieties) are an approximation to user-perceived characters. > > The specific quote I think you refer to is: > > -- > It is important to recognize that what the user thinks of as a "character"—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically. > -- > >> >> This seems to me like good terminology to follow. >> > > The challenge here is that Unicode (and CSS) both define the term "character" to have a specific meaning equivalent to a Unicode codepoint, i.e. the "computer use" of the term. CSS3 Text, however, attempts to redefine and then use the term "character" to also mean a "user-perceived character". The use of the word "character" after that point is somewhat haphazard, leading to a number of problems in understanding the spec. Our primary comment is that we'd prefer to see a term other than (unadorned) "character" used where "user-perceived character" is intended. > > I agree that we could use "user-perceived character" instead of "grapheme cluster". My reservation about that is that a "grapheme cluster" (of various flavors and stripes) can be "determined programmatically", which is a consideration for implementation. If the "user-perceived character" cannot be determined programmatically, it is not possible to do much with it in terms of CSS. Hence, I think using the [whatever] "grapheme cluster" terminology is useful here because that is the unit that CSS will actually operate on in the cases where "user-perceived character" is intended. > > The ending part of my comment (which grew out of WG discussion): > >> ... Rather, we should say that applications sometimes require additional >> rules beyond the use of 'grapheme clusters' in order to handle >> the typographic traditions of particular scripts. > > ... suggests that some scripts require "tailored grapheme clusters" (we're aware of claims of Indic script or language requirements in this regard) but for which there is no fully-defined tailoring to point to. > > HTH, > > Addison > >
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
Richard Ishida   Thu, 22 May 2014 19:09:40 +0100

www-style > May 2014 > 0000.html

Received on Thursday, 22 May 2014 18:10:10 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: www-international@w3.org
Copied to: www-style@w3.org, www-style@w3.org.

Thank you for reworking section 1.3.1 in the latest editor's version (dated 20 March 2014). It's great to move away from the vague definition of character that we had before, but I'm still concerned about the text for a couple of reasons. One is that, as I mentioned already, it is not correct to say 'the "user-perceived character", also know as the grapheme cluster.' The equivalent term for a user-percieved character is 'grapheme'. The 'grapheme cluster' is a unit derived from rules in Unicode to yield an *approximation* to a user-defined character. Not all user-perceived characters are grapheme clusters. Another is a worry whether we can really effectively split the world into semantically-perceived and visually-perceived characters - especially given the 'etc' that appears in the definition where we list appropriate operations for each. For example, are we sure that first-letter operations require semantically- rather than visually-perceived characters in all cases? Where does cursor movement fit here? etc. What about Arabic justification which may involve increasing word -internal 'gaps' that occur due to one glyph not joining with the following glyph. These are relevant units for justification of Arabic text, but they aren't user-perceived characters. And what about the case where Indic script text units vary according to the font in use. As I understand it, a text unit for wrapping or stretching in Devanagari can encompass a CvCVD (consonant, virama, consonant, vowel sign, diacritic) only if the font has glyphs to show this is a single visual unit (eg. ligatures, half-forms, special glyphs) and hides the virama. If the font is changed, such that the virama becomes visible, we are now dealing with two text units. This font-specific behaviour for the same sequence of code points is a contextual difference that, I think, cuts across both the semantic- and visual- categories currently defined. I think that actually all we may be trying to say is that the atomic unit of text for a particular operation may not be the same as for another, but that we start from a base of grapheme clusters and require the application to take into account variances and extensions of that as needed. What if we simply talk in terms of vague 'typographic units', or 'text units', or some such, but describe up front how these can be different sequences of code points depending on the operation to be performed (ie. not try to define just two specific scenarios)? To help with that, I propose the following text for section 1.3.1 to replace the 2nd paragraph and the DL list. ===================================================== For text layout the appropriate atomic units of text may include more than one Unicode code point. Often these text units correspond to *graphemes*, ie. what a language user (as opposed to a computer programmer) considers to be a character or basic unit of the script. Unfortunately, the appropriate units may be different for the same sequence of Unicode codepoints according to the operation which is being performed, or according to the visual context. (For example, line-breaking and letter-spacing may interpret a sequence of Thai characters that include U+0E33 THAI CHARACTER SARA AM differently; or the behaviour of a conjunct consonant in a script such as Devanagari may depend on the font in use). The Unicode specification defines various combinations of code points as forming *extended grapheme clusters*. This is an attempt to indicate what users perceive as characters, and the term is described in detail in the Unicode Technical Report: Text Boundaries [UAX29]. Much of the time this produces the necessary text units for layout, however it is only an approximation and in some cases, such as those mentioned above, additional rules need to be applied by the application to tailor the definition of the text unit appropriately for the context. Applications need to be aware of the typographic rules that must be used to determine units of text for a given operation on a particular script, and apply them to achieve the appropriate segmentation of the text for that operation. ===================================================== RI PS: Btw, the definition of semantically-perceived character has a sentence that says that tailoring may be necessary. Doesn't this also apply to visually-perceived characters (eg. in the Thai case)? On 21/02/2014 13:53, Richard Ishida wrote: > On the subject of grapheme clusters, rather than characters, may help to > note the Unicode Standard definitions here: > > ==== > *Grapheme*. (1) A minimally distinctive unit of writing in the context > of a particular writing system. For example, ‹b› and ‹d› are distinct > graphemes in English writing systems because there exist distinct words > like big and dig. Conversely, a lowercase italiform letter a and a > lowercase Roman letter a are not distinct graphemes because no word is > distinguished on the basis of these two different forms. (2) What a user > thinks of as a character. > > *Grapheme Cluster*. The text between grapheme cluster boundaries as > specified by Unicode Standard Annex #29, "Unicode Text Segmentation." > (See definition D60 in Section 3.6, Combination.) A grapheme cluster > represents a horizontally segmentable unit of text, consisting of some > grapheme base (which may consist of a Korean syllable) together with any > number of nonspacing marks applied to it. > ====== > > The text in the spec "A grapheme cluster is what a language user > considers to be a character or a basic unit of the script." is > incorrect. What a user considers to be a basic unit of the script is a > grapheme. A grapheme cluster is a construct with a specific desciption > that tries to approximate to the user perceived graphemes (and signally > fails in some contexts). > > If you want a vague term to refer to something that includes grapheme > clusters and characters in the spec, why not use 'grapheme' rather than > 'character'. > > RI > > > On 24/01/2014 22:26, Phillips, Addison wrote: >>> The definition of "grapheme cluster" in the Unicode Glossary defers >>> to UAX 29, >>> but the current revision (23) of that UAX doesn't actually have a formal >>> definition of "grapheme cluster", except as a cover term for default >>> grapheme >>> clusters, extended grapheme clusters, and tailored grapheme clusters, >>> which >>> *are* defined. >>> >>> It does, however, introduce the informal term "user-perceived >>> character", and >>> says that grapheme clusters (by implication, of one of the above >>> varieties) are an approximation to user-perceived characters. >> >> The specific quote I think you refer to is: >> >> -- >> It is important to recognize that what the user thinks of as a >> "character"—a basic unit of a writing system for a language—may not be >> just a single Unicode code point. Instead, that basic unit may be made >> up of multiple Unicode code points. To avoid ambiguity with the >> computer use of the term character, this is called a user-perceived >> character. For example, “G” + acute-accent is a user-perceived >> character: users think of it as a single character, yet is actually >> represented by two Unicode code points. These user-perceived >> characters are approximated by what is called a grapheme cluster, >> which can be determined programmatically. >> -- >> >>> >>> This seems to me like good terminology to follow. >>> >> >> The challenge here is that Unicode (and CSS) both define the term >> "character" to have a specific meaning equivalent to a Unicode >> codepoint, i.e. the "computer use" of the term. CSS3 Text, however, >> attempts to redefine and then use the term "character" to also mean a >> "user-perceived character". The use of the word "character" after that >> point is somewhat haphazard, leading to a number of problems in >> understanding the spec. Our primary comment is that we'd prefer to see >> a term other than (unadorned) "character" used where "user-perceived >> character" is intended. >> >> I agree that we could use "user-perceived character" instead of >> "grapheme cluster". My reservation about that is that a "grapheme >> cluster" (of various flavors and stripes) can be "determined >> programmatically", which is a consideration for implementation. If the >> "user-perceived character" cannot be determined programmatically, it >> is not possible to do much with it in terms of CSS. Hence, I think >> using the [whatever] "grapheme cluster" terminology is useful here >> because that is the unit that CSS will actually operate on in the >> cases where "user-perceived character" is intended. >> >> The ending part of my comment (which grew out of WG discussion): >> >>> ... Rather, we should say that applications sometimes require >>> additional >>> rules beyond the use of 'grapheme clusters' in order to handle >>> the typographic traditions of particular scripts. >> >> ... suggests that some scripts require "tailored grapheme clusters" >> (we're aware of claims of Indic script or language requirements in >> this regard) but for which there is no fully-defined tailoring to >> point to. >> >> HTH, >> >> Addison >> >> > >
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
fantasai   Wed, 25 Jun 2014 08:11:03 -0700

www-style > June 2014 > 0000.html

Received on Wednesday, 25 June 2014 15:11:37 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: ishida@w3.org, www-international@w3.org
Copied to: www-style@w3.org, www-style@w3.org.

On 05/22/2014 11:09 AM, Richard Ishida wrote: > > One is that, as I mentioned already, it is not correct to say 'the "user-perceived character", also know as the grapheme > cluster.' The equivalent term for a user-percieved character is 'grapheme'. The 'grapheme cluster' is a unit derived from > rules in Unicode to yield an *approximation* to a user-defined character. Not all user-perceived characters are grapheme > clusters. I'm fine to remove that phrase if it's problematic. Is it problematic in UAX29 also? (Does it need a bug filed there?) > Another is a worry whether we can really effectively split > the world into semantically-perceived and visually-perceived > characters - especially given the 'etc' that appears in the > definition where we list appropriate operations for each. > For example, are we sure that first-letter operations require > semantically- rather than visually-perceived characters in all > cases? Where does cursor movement fit here? etc. I think I have to conclude that no, we can't. > What about Arabic justification which may involve increasing > word -internal 'gaps' that occur due to one glyph not joining > with the following glyph. These are relevant units for > justification of Arabic text, but they aren't user-perceived > characters. Is that really a relevant concept? Increasing word-internal 'gaps' is a horrible way to justify Arabic text, look: http://dev.w3.org/csswg/css-text/arabic-stretch-unjoined It results in uneven typographic color and obscures word boundaries. It might exist, but I've never seen it... > And what about the case where Indic script text units vary > according to the font in use. As I understand it, a text > unit for wrapping or stretching in Devanagari can encompass > a CvCVD (consonant, virama, consonant, vowel sign, diacritic) > only if the font has glyphs to show this is a single visual > unit (eg. ligatures, half-forms, special glyphs) and hides > the virama. If the font is changed, such that the virama > becomes visible, we are now dealing with two text units. > This font-specific behaviour for the same sequence of code > points is a contextual difference that, I think, cuts across > both the semantic- and visual- categories currently defined. Okay. > I think that actually all we may be trying to say is that > the atomic unit of text for a particular operation may not > be the same as for another, but that we start from a base > of grapheme clusters and require the application to take > into account variances and extensions of that as needed. > What if we simply talk in terms of vague 'typographic units', > or 'text units', or some such, but describe up front how > these can be different sequences of code points depending > on the operation to be performed (ie. not try to define > just two specific scenarios)? Overall, I agree with the concept, but I want to make sure that the spec is somehow understandable to people who are not either a) members of the i18nWG or a similar community b) text layout implementation experts (If Lea Verou cannot make sense of the CSS Text spec well enough to use it as a reference for the properties it defines, then I consider the spec to be a failure.) I've reworked the Terminology section following your suggestions: will work on the rest of the spec tomorrow and hopefully have it all make sense soon. ^_^ ~fantasai
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
James Clark   Thu, 26 Jun 2014 00:04:35 +0700

www-style > June 2014 > 0000.html

Received on Wednesday, 25 June 2014 17:05:25 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: ishida@w3.org
Copied to: www-international@w3.org, www-style@w3.org, www-style@w3.org.

On Fri, May 23, 2014 at 1:09 AM, Richard Ishida <ishida@w3.org> wrote: > > > Another is a worry whether we can really effectively split the world into > semantically-perceived and visually-perceived characters - especially given > the 'etc' that appears in the definition where we list appropriate > operations for each. For example, are we sure that first-letter operations > require semantically- rather than visually-perceived characters in all > cases? Where does cursor movement fit here? etc. > characters (eg. in the Thai case)? The fundamental split, in my view, is between characters and glyphs. There are operations that are best understood as working on clusters of characters and there are operations that are best understood as working on clusters of glyphs. I would argue that cursor movement and line-breaking are character-level operations, whereas first-letter operations and letter-spacing are glyph-level operations. For example, in Thai the boundary following a first-letter or the boundary where letter-space is to be inserted sometimes does not correspond to a boundary between characters. James
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
Andrew Cunningham   Thu, 26 Jun 2014 09:59:01 +1000

www-style > June 2014 > 0000.html

Received on Wednesday, 25 June 2014 23:59:28 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: jjc@jclark.com
Copied to: ishida@w3.org, www-international@w3.org, www-style@w3.org, www-style@w3.org.

On 26 June 2014 03:04, James Clark <jjc@jclark.com> wrote: > > On Fri, May 23, 2014 at 1:09 AM, Richard Ishida <ishida@w3.org> wrote: > >> >> Another is a worry whether we can really effectively split the world into >> semantically-perceived and visually-perceived characters - especially given >> the 'etc' that appears in the definition where we list appropriate >> operations for each. For example, are we sure that first-letter operations >> require semantically- rather than visually-perceived characters in all >> cases? Where does cursor movement fit here? etc. >> characters (eg. in the Thai case)? >> > > The fundamental split, in my view, is between characters and glyphs. There > are operations that are best understood as working on clusters of > characters and there are operations that are best understood as working on > clusters of glyphs. > > I would argue that cursor movement and line-breaking are character-level > operations, whereas first-letter operations and letter-spacing are > glyph-level operations. For example, in Thai the boundary following a > first-letter or the boundary where letter-space is to be inserted sometimes > does not correspond to a boundary between characters. > And for some languages the boundary for first-letter may not correspond to first character or to first grapheme cluster. next week I hope to free enough time to play with javascript and see if i can put together a script to detect first syllable of an element for a couple of languages where it would be a useful alternative A. -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham@slv.vic.gov.au lang.support@gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/
Re: [css-text] I18N-ISSUE-313: Definition of grapheme clusters
Richard Ishida   Thu, 07 Aug 2014 14:35:47 +0100

www-style > August 2014 > 0000.html

Received on Thursday, 7 August 2014 13:36:16 UTC

Show in list: by dateby threadby subjectby author

Link to this message in this page.

Sent to: www-international@w3.org
Copied to: www-style@w3.org, www-style@w3.org.

Thank you for your work on this. The i18n WG is now happy to close this issue. RI >> On 24/01/2014 22:26, Phillips, Addison wrote: >>>> The definition of "grapheme cluster" in the Unicode Glossary defers >>>> to UAX 29, >>>> but the current revision (23) of that UAX doesn't actually have a >>>> formal >>>> definition of "grapheme cluster", except as a cover term for default >>>> grapheme >>>> clusters, extended grapheme clusters, and tailored grapheme clusters, >>>> which >>>> *are* defined. >>>> >>>> It does, however, introduce the informal term "user-perceived >>>> character", and >>>> says that grapheme clusters (by implication, of one of the above >>>> varieties) are an approximation to user-perceived characters. >>> >>> The specific quote I think you refer to is: >>> >>> -- >>> It is important to recognize that what the user thinks of as a >>> "character"—a basic unit of a writing system for a language—may not be >>> just a single Unicode code point. Instead, that basic unit may be made >>> up of multiple Unicode code points. To avoid ambiguity with the >>> computer use of the term character, this is called a user-perceived >>> character. For example, “G” + acute-accent is a user-perceived >>> character: users think of it as a single character, yet is actually >>> represented by two Unicode code points. These user-perceived >>> characters are approximated by what is called a grapheme cluster, >>> which can be determined programmatically. >>> -- >>> >>>> >>>> This seems to me like good terminology to follow. >>>> >>> >>> The challenge here is that Unicode (and CSS) both define the term >>> "character" to have a specific meaning equivalent to a Unicode >>> codepoint, i.e. the "computer use" of the term. CSS3 Text, however, >>> attempts to redefine and then use the term "character" to also mean a >>> "user-perceived character". The use of the word "character" after that >>> point is somewhat haphazard, leading to a number of problems in >>> understanding the spec. Our primary comment is that we'd prefer to see >>> a term other than (unadorned) "character" used where "user-perceived >>> character" is intended. >>> >>> I agree that we could use "user-perceived character" instead of >>> "grapheme cluster". My reservation about that is that a "grapheme >>> cluster" (of various flavors and stripes) can be "determined >>> programmatically", which is a consideration for implementation. If the >>> "user-perceived character" cannot be determined programmatically, it >>> is not possible to do much with it in terms of CSS. Hence, I think >>> using the [whatever] "grapheme cluster" terminology is useful here >>> because that is the unit that CSS will actually operate on in the >>> cases where "user-perceived character" is intended. >>> >>> The ending part of my comment (which grew out of WG discussion): >>> >>>> ... Rather, we should say that applications sometimes require >>>> additional >>>> rules beyond the use of 'grapheme clusters' in order to handle >>>> the typographic traditions of particular scripts. >>> >>> ... suggests that some scripts require "tailored grapheme clusters" >>> (we're aware of claims of Indic script or language requirements in >>> this regard) but for which there is no fully-defined tailoring to >>> point to. >>> >>> HTH, >>> >>> Addison >>> >>> >> >> > >