I18N/CanonicalNormalizationIssues
Summary of Problem
Many things that people perceive as "a character" can be represented in multiple ways in Unicode. To take a simple example, a small "a" with an acute accent can be represented as either: U+00E1 LATIN SMALL LETTER A WITH ACUTE or as the sequence: U+0061 LATIN SMALL LETTER A U+0301 COMBINING ACUTE ACCENT Tools that users use to input text may vary as to which of these forms is produced, depending on the programs used (operating systems, input methods, editors) and perhaps on how the user enters the text.
Normalization effects some languages more than others. For example English uses very few combining characters. On the other hand, other languages may use combining characters quite often. Vietnamese relies heavily on combining characters. Hangul Jamo also includes multiple character sequences in Unicode to express the same composite character.
Unicode normalization is the process of converting to a form in which these differences are not present. NFC normalization is a set of rules for converting strings containing characters such as those above to the most-combined (composed) form (e.g., U+00E1 above) wherever possible, and NFD normalization is a set of rules for converting everything to the most-separated (decomposed) form (e.g., U+0061 U+0301 above). The focus is only on canonical equivalent character sequences and not compatibility equivalent characters because, particularly for markup, it is easy to continue the longstanding recommendation to authors to avoid such compatibility characters (in markup as opposed to content).
Various Web technologies depend on string matching. For example, CSS selectors allow matching of author-chosen classes and IDs, and the document.getElementById() method allows retrieving an element by its ID. When authors use strings in their own language, those strings should match when the author perceives those strings to be the same, whether or not different tools were used to produce, e.g., the markup and the style sheet. This author expectation is not met when the string match fails because of differences in Unicode normalization.
Canonical Normalization
Need for normalization
- Normalizing the order of combining character sequences (where characters have a non-zero canonical combining class). For some combining characters the order is significant. For other combining characters the order is simply a matter of happenstance, and the various order of such combining marks must not lead to the treatment as unequal character sequences. (Note that this normalization issue would occur in Unicode even if precomposed characters were never admitted)
- Normalizing between precomposed and decomposed characters
- Normalizing canonical singletons (these characters where perceived duplicates are encoded in UCS/Unicode or characters encoded are controversially deemed as duplicates)
An example of a combining character sequence that has multiple canonical representations: Rescued from the lists web archive:
- a) Ệ (U+1EC6) [NFC]
- b) Ê ◌̣ (U+00CA-U+0323)
- c) Ẹ ◌̂ (U+1EB8-U+0302)
- d) E ◌̂ ◌̣ (U+0045-U+0302-U+0323)
- e) E ◌̣ ◌̂ (U+0045-U+0323-U+0302) [NFD]
- E: canonical combining class = 0 (base character)
- Combining Dot Below: canonical combining class = 220 (below)
- Combining Circumflex Accent: canonical combining class = 230 (above)
Note: It is worth noting here that the above example of Ệ (U+1EC6) allows the use of a single octet in NFC UTF-8 to express the same canonically equivalent character that might otherwise require 3 octets to express in a decomposed form. However for minority languages where Unicode is sometimes allocating reserved code points but rarely assigning precomposed characters for common combining characters, the analogous difference in UTF-8 might be 9 octets rather than 3 octets for UTF-8 (or 6 versus 2 for UTF-16). So the Unicode policy of avoiding precomposed representations for newly added scripts (only minority scripts) may also be a significant I18N problem. In fact, Unicode has reserved over 1,000 characters within scripts, many of which are ostensibly for unassigned precomposed characters. If the more compact encoding gained by pre-composition in NFC is important for English and other Latin based languages, then it is even more important for these minority languages which—though they may have fewer characters than Latin—still cannot be encoded in the 1 or 2 octet per code point range of the BMP.
An example of a singleton where the character is identified as equivalent but there are disputes among the Unicode community.
- a) 慈 (U+2F8A6) [non-normalized]
- b) 慈 (U+6148) [NFC and NFD]
Another such example (cited by Ambrose Li)
- a) 茝 (U+2F999) [non-normalized]
- b) 茝 (U+831D) [NFC and NFD]
Normalization forms
- NFC where characters with composed canonical equivalents as of Unicode 3.1 remain in their precomposed form and all other canonical equivalent characters are exchanged for their character equivalents and combining characters are reordered according to the integer value of their canonical combining class property (1–255)
- NFD where all characters with canonical decompositions are replaced with their canonical decompositions and combining characters are reordered according to the integer value of their canonical combining class property (1–255)
- Another possibly new W3C normalization form (NFW3C?) that excludes singletons (since singleton normalization has a potential to be lossy when compared to author expectations). Such a new normalization form might also admit precomposed characters post Unicode 3.1 (including allowing a whole slew of new precomposed characters for minority scripts that would not be part of NFC normalization). After this new normalization form is added with new precomposed characters new scripts could also be encoded with new precomposed characters later whenever a new base character is simultaneously assigned a code point.
Normalization stage
- In input systems and immediately upon input
- In parsers and other text processing
- Late stage for string matching and other string comparisons
Normalization extent
(in increasing quantity of characters)
- Markup only (element names, attribute names)
- Markup plus attribute values (needed especially for CSS selector matching and DOM matching)
- All text
Other issues
- Issues of parser injections where a character sequence starting with a grapheme extending character (a character with the property Grapheme_Extend) is appended to a character sequence and therefore requiring re-normalization of previously normalized strings.
- Note that this is similar to line ending normalizations with CRLF where a parser may need to check where a string is inserted to determine if the normalization needs to be reapplied. However, whereas CRLF normalization cannot prohibit strings from starting with LF, it would be proper to prohibit authors from authoring content in these particular situations that start with characters with a Grapheme_Extend property: a grapheme extender character (these don't usually make sense starting a character sequence anyway and disallowing them in this particular situation would still allow for workarounds that did not break a parser-based normalization solution).
- Performance issues (e.g., an XML parser that now ignores normalization would need to check every character to ensure it a member of the set of characters whose NFC_Quick_Check property matched NO and MAYBE for every character used for markup and attribute values (as opposed to content). Even for NFC fully normalized content, the parser would need to ensure each character in markup and attribute values was not a NO or MAYBE and for those that were MAYBE check that they were permitted for their combining base character or immediately prior character. Also for fully normalized content, the parser / processor would still check to confirm that the combining characters were all arranged according to their canonical combining class integer value (1 - 255).
Problems with canonical singletons
- Normalizing the canonical order of combining marks is essential to the Unicode processing model
- Likewise, normalizing between decomposed character sequences and precomposed characters sequences became essential to Unicode as soon as it assigned the first precomposed combining character
- However canonical singletons are fundamentally different than these other two cases
- The semantic equivalence of two canonical equivalent singletons is typically in dispute: which is what necessitates the encoding of the singletons rather than simply excluding them entirely
- To reinforce the equivalence of singletons, Unicode fonts would need to treat the glyphs as identical in all cases so authors would not rely on the distinction, but that does not happen
- To reinforce the equivalence of singletons, Unicode input systems would need to always map the un-normalized (discouraged) canonical equivalent to the normalized (preferred) canonical equivalent singleton, but that does not happen (note that singletons always map to the same character whether normalized to NFC or NFD)
- Singletons are easier for authors to control than the use of precomposed or decomposed characters or the canonical ordering of combining marks
- There are dangers of semantic loss in persistently normalizing canonical singletons where no such danger occurs for persistent non-singleton normalization.
So singletons are fundamentally different than the other parts of canonical equivalence. There are certainly reasons for Unicode to avoid encoding such characters and even to discourage their use once encoded. However, Unicode has done little to communicate to the end users and authors the need to avoid such singletons. Lumping them together with the other canonical equivalents causes normalization problems such as a loss of semantic information. The poor communication regarding canonical singletons has also lead to their misuse for grapheme variations (or even the misuse of plain text by including glyph variants in character sequences) instead of the use of registered variation selectors for this purpose (Note too that use of variation selectors rather than canonical decomposing character means that no lookups are required to normalize text, but variation selectors merely need to be ignored while comparing strings).
It has been over 15 years since the first canonical singletons were added to the Unicode repertoire. Over that time period, insufficient effort has been directed toward communicating to authors and authoring tools implementors the need to treat canonical singletons as identical to their preferred decomposed character as a combining mark in a non-canonical order would be to a combining mark in another order. Therefore W3C may have reason to exclude such singletons from any normalization advice it gives to authors and implementors.
For singletons (as for compatibility decomposable characters other than the phonetically related characters), the best approach might be for W3C to prohibit their use in markup and discourage their use in content (for all relevant W3C recommendations).
Ingredients for an NFC (or NFW3C) canonical normalization algorithm
- Check every character to ensure it a member of the set of characters whose NFC_Quick_Check property matched NO (1,115 code points) and MAYBE (102 code points) [for either every character used for markup or every character depending on the extent of normalization].
- For NFC_Quick_Check=MAYBE that are Grapheme_Extend characters, check whether the entire character sequence from the immediately preceding Grapheme_Base character to the last Grapheme_Extend character is permitted in NFC.
- For NFC_Quick_Check=NO and those MAYBE character excluded by the previous step, replace the character or character sequence with the precomposed NFC normalized character or singleton decomposition.
- All remaining combining characters must be reordered according to their canonical combining class integer value (preserving their order otherwise).
[NFW3C would need to develop a smaller set of NFW3C_Quick_Check=NO set of characters (the same as the NFC_Quick_Check=NO except for no singleton decompositions) and the MAYBE set would remain the same making for a faster algorithm and one not susceptible to undermining author semantics allowing lossless normalization]
Ingredients for an NFD (or NFW3D) canonical normalization algorithm
- Check every character to ensure it a member of the set of characters whose NFD_Quick_Check property matched NO (13,215 code points) [for either every character used for markup or every character depending on the extent of normalization].
- For NFD_Quick_Check=NO character encountered, replace the character with the decomposed NFD normalized character or character sequence.
- All combining characters must be reordered according to their canonical combining class integer value (preserving their order otherwise).
[NFW3C would need to develop a smaller set of NFW3D_Quick_Check=NO (the same as the NFD_Quick_Check=NO except for no singleton decompositions) set of characters making for a faster algorithm and one not susceptible to undermining author semantics allowing lossless normalization]
Non-canonical normalization
A completely separate issue from canonical normalization is the issue of compatibility normalization (though the existence of singletons within canonical equivalence muddies the distinction somewhat). This type of normalization should likely only be performed on author request since the issues are too intricate for machine processing to deal with alone. However, the first category of semantically independent normalizations could be performed without user interaction for systems that are not dependent on any of these legacy text imaging issues (ligatures and other precomposed glyphs, vertical CJK rendering, positional forms). In fact, round-tripping in semantic terms is potentially possible from the first category without using these compatibility characters at all, but by merely using the context of the character and basic styling information regarding ligature inclusion and vertical presentation of CJK text.
As one reads down the list the desirability of normalization decreases; the likelihood of semantic loss increases; and the more the characters tend to meet the contemporary Unicode Standard processing model.
- Normalizations that are semantically independent but have legacy text imaging issues:
- Arabic and Alphabetic positional forms (keywords: 'initial', 'medial', 'final', 'isolated')
- Vertical CJK Forms (keywords: 'wide', 'small', 'square', 'narrow', 'vertical')
- Vulgar fractions ('fraction'), though decomposition of these should add a space (U+0020) to keep the fraction numerator from melding into the surrounding text or preceding whole number (perhaps the Medium Mathematical Space U+205F might be used instead, upgrading this character to provide this particular semantic as a whole-number/fractional separator, thus removing it from the compatibility characters).
- Isolated diacritics (keyword 'compat' decompositions where the first character is a U+0020 space and the other(s) have a canonical combining class other than 0)
- Ligatures (mixed among the 'compat' keyword decompositions but all decompose to more than one character; where all characters have a canonical combining class of 0; and all the decomposition characters have a general category starting with "L")
- Ligating (Precomposed) Roman Numerals ('compat' keyword non-singleton decompositions where general category "Nl" decomposes to "Ll" or "Lu" but decomposed to roman numerals instead of latin letters)
- Normalizations that include potential semantic loss unless compensating rich text or markup is applied during normalization
- 'font' keyword decompositions (except ℹ U+2139 and possibly others that are semantically distinct symbolic graphemes)
- 'super' keyword decompositions (only general category "No" and general category starting with "P" and "S" and some but not all general category starting with "L"; basically those that are the same graphemes as the 'sub' keyword characters but superscripted instead of subscripted)
- 'sub' keyword decompositions
- 'compat' keyword decompositions for markers (e.g., ⒓ U+2493)
- 'compat' keyword decompositions for CJK radical characters
- 'compat' keyword rich text spacing spaces (decomposed to U+0020)
- 'circle' keyword decompositions (these might be better listed in the next sublist (3) or even the final sublist (4) since they might often be used for semantically distinct purposes and a common styling mechanism does not exist for encircling characters until the introduction of CSS3)
- Normalizations that include potential semantic loss (The best solution for handling these decomposable characters is to discourage use or even prohibit use in markup and attribute values. Though that may be difficult due to disputes with the authoring community over the assignment of these separate decomposable characters). A careful examination of all of these (over 1,000 ideographs and 150 others) characters is worthwhile considering the ambiguities these characters create. The results should be to either deprecate the character or remove them from compatibility / canonical status.
- possibly some canonical singleton decompositions (though potentially all canonical singletons could be included in this category) (more than 1,000 Ideographs and about 32 others)
- Singleton 'compat' keyword decompositions
- Suzhou / Hangzhou numerals (about 4)
- Hangul (about 94)
- Symbols that use graphemes already encoded in Unicode (not really meant for encoding in Unicode as its presently framed: e.g., 'ϐ' U+03D0) (about 25)
- Normalizations that should probably never happen persistently but should be considered for volatile (non-persistent) character folding. Also there's no real reason to discourage the use of these characters for authors. These categories all constitute characters with important properties fundamental to the Unicode processing model (though for non-persistent character folding, such distinctions typically need to be ignored).
- Some 'super' keyword decompositions (those involved with phonetic semantics and likely used elsewhere)
- Unicode characters so far are only assigned to graphemes (other than legacy compatibility and some very few line-break, grapheme-break and word-break formatting characters). Therefore phonetic characters are not assigned specifically to phonemes, but to specialized graphemes that work together with existing graphemes to make complete phonetic writing systems. Some of the phonetic graphemes used in the IPA and other phonetic writing system are characters that are assigned to 'super' keyword compatibility decomposing characters.
- Some 'super' keyword decompositions (those involved with phonetic semantics and likely used elsewhere)
- Singleton 'compat' keyword decompositions
- Decomposed Roman Numerals (singleton 'compat' keyword characters with general category 'Nl')
- Decorative overlines and low lines
- Other 'compat' keyword decompositions
- Ligating (Precomposed) Mathematical Symbols ('compat' keyword non-singleton decompositions where general category starting with "S" decomposes to general category starting with "S": e.g., ∯ which decomposes to the sequence that should be displayed identically: ∮∮). While ligatures are most certainly not a part of the core Unicode processing model, these equivalent ligatures represent unique symbolic graphemes that are not really the visual joining of two graphemes that remain distinct (like fi or fl).
- Symbolic graphemes (e.g., "№", ℡, ℅; approximately 13 characters). These might possibly be formed as ligature glyphs, but they too have distinct meaning and it wouldn't be the case that for example, the character sequence tel should always appear as "℡".
- Ideographic symbols (similar to the symbolic graphemes e.g., ㏵ U+33F5; includes 68 characters: 12 months; 31; days; 25 hours)
- 'nobreak' keyword decompositions (In modern Unicode these should really decompose to for example: the sequence <Word Joiner U+2060>–<Hyphen U+002D>–<Word Joiner U+2060> which could then be considered canonically equivalent to Non-Breaking Hyphen U+2011 though most Unicode processors today do not process Word Joiner correctly and even if they did Non-Breaking Hyphen would be a preferred precomposed form as in NFC)
Supplemental Decomposition
While a new non-singleton canonical decomposition can be accomplished without the addition of any new native character properties, it might be useful to establish new derived properties and new derived property data files. Moreover, the above analysis of compatibility decompositions suggests that some new decomposition properties should be established for existing compatibility decomposable characters.
- 'fraction' keyword (16 characters) should all decompose to the same characters as their current decomposition except with a leading Space (U+0020) or Medium Mathematical Space (U+205F) to ensure the fractional part of a mixed number does not join with the whole portion (e.g., so 1¼ does not become 11⁄4 but instead 1 1⁄4 or 1 1⁄4)
- 'nobreak' keyword decompositions which should become a canonical decomposition and decompose the same as now except wrapped in an adjacent pair of Word Joiner characters (U+2060) but the 'nobreak' characters such as hyphen and space serve as common uses of no-break behavior so should remain precomposed but canonically equivalent.
- 'isolated' keyword word ligatures These Arabic words could actually become canonical decompositions where they might sere as common precomposed words that help reduce the bytes needed for interchange (U+FDF0–U+FDFD).
- 'compat' keyword Roman Numerals where the precomposed Roman Numerals should decompose to decomposed Roman Numerals (rather than Latin Letters as they currently do). The singleton Roman Numerals should remain distinct from the Latin letters just as the other ancient numbers remain undecomposable (when Roman Numerals were first encoded it was thought that Unicode would not encode such letter derived ancient numbers that share similar graphemes to the letters as separate abstract characters).
While these changes are minor, they make both canonical and compatibility decompositions much more useful. These adjustments help make these decompositions much more like canonical equivalence than compatibility equivalence. However, their precomposed form can remain as a way of reducing text document file size.
Character folding
(useful in search indexing and interactive graphical user interface search matching where folding is used in volatile memory but not subsequently and persistently serialized). In contrast to normalization where ideally the results of normalization might be persistent, character folding involves string matching where false positive matches are more acceptable and false-negative matches need to be avoided. Therefore string matching might be done in ways that overlook all of the subtle differences Unicode characters provide.
- Combining-insensitive (where all combining marks are ignored)
- Diacritic-insensitive (where not only are combining marks ignored, but alternate representations match e.g., "ö" and "oe" so the literal "oe" matches "o" and "ö")
- Case-insensitive
- Typographical-insensitive (e.g, matching " and “)
- All canonical and compatibility equivalent differences are ignored
- Differences in spaces, word separators and word joiners are all ignored (all folded to U+0020 or U+000A)
- Grapheme joiners and non-joiners are ignored
- Decimal digit folding (all decimal digit numbers replaced with their decimal digit value so e.g., "٥" folded to "5")
- Typographic characters folded to typewriter versions (e.g., em dash "—" and en dash "–" folded to hyphen and curly quotes folded to straight quotes)
- Hiragana Folding
- Katakana Folding
- LetterLike grapheme symbols folding (e.g., "℡" folded to "tel")
- Suzhou numeral folding
- Other Non-Arabic-Indic numeral folding ("Ⅸ" folded to "9")
- Ignoring variation selectors so any combination Grapheme_Base+VariationSelector is folded with the Grapheme_Base alone
- Ignoring all special purpose characters (joiners, non-joiners, mathematical invisibles, soft hyphen, bidirectional controls, interlinear annotation, etc)
Though not strictly a character folding issue, interactive searches should also consider matching embedded non-text media when the metadata for that media also matches user search strings and patterns. Likewise CSS generated content should be included in interactive searches, but not necessarily for document indexing.
Solutions
Properties of a new normalized form
- A) skip singleton canonical decompositions: treated instead like compatibility decompositions; less important than the identity equivalence of precomposed, decomposed and differently ordered combining marks (leaving fewer than 90 characters in the composing normalization NO category and still 104 in the MAYBE category).
- B) diminished focus on stability so that new precomposed characters can be added (though perhaps less frequently than other Unicode updates; authors will need to be cautious about using precomposed characters in Unicode processes not yet updated to support them in terms of string matching, string collation, font support, etc.).
- C) admit the post 3.1 precomposed characters into the composed normalized form.
- D) add precomposed characters to scripts where reserved characters have been allocated (to foster better I18N support for precomposed characters throughout the World). Over 1,500 characters have been reserved in various scripts, often as placeholders for commonly occurring precomposed characters. Unicode should make these precomposed character assignments and include them in a new normalization form.
- E) add precomposed characters to newly allocated scripts at the same time the script is allocated to bring simultaneous precomposed support for the most common combining characters.
- F) treat any newly added canonical singletons instead as official variants using variation selectors which automatically implies a relation with the base character and as default ignorable characters will display appropriately in legacy systems.
- G) in general leave the responsibility for normalizing inserted and appended text to the process doing the inserting including managing the appending or inserting of grapheme extenders which is the only place where the normalization algorithm breaks in terms of closure (including for example: a) a DOM process which uses document.write, innerHTML or even appendChild when that child starts with a Grapheme_Extender; or b) CSS generated content so that CSS generate context prohibits an initial Grapheme_Extender character which would also break normalization closure). In other words the normalization algorithm is closed in the sense that any inserting or appending into a normalized string of another normalized string will also be normalized as long as the string inserted or appended also does not start with a Grapheme_Extender (i.e., starts with an incomplete grapheme cluster).
Possible W3C Criteria
- Authoring Criteria
- Recommend to authors a new NFW3C normalization for all markup and content.
- Prohibit authors from staring any DOMString withGrapheme_Extend characters if it is to be used with document.write, innerHTML, outerHTML or otherwise involved with the parser (i.e., the author is responsible for re-normalizing concatenated DOMStrings)
- Discourage authors from starting any element or attribute value with a Grapheme_Extend character (even after excluding whitespace)
- to the extent this is allowed, XML and other vocabularies should specifically address the meaning of elements and attribute value starting with Grapheme_Extend characters (and perhaps prohibited everywhere else). For example HTML might allow Grapheme_Extend characters to start the content of a 'span' element but not any other element and not any markup or attribute values. When starting a 'span' element,
- the span element must occur as part of a grapheme cluster where a Grapheme_Base character or a Grapheme_Base character and other Grapheme_Extend characters appear immediately before the 'span' element and
- the 'span' element must only contain Grapheme_Extend characters associated with the Grapheme_Base character immediately preceding the element.
- Authors may use such Grapheme_Extend content in a 'span' element to apply special styling treatment to the Grapheme_Extend characters compared to the Grapheme_Base or other Grapheme_Extend characters before or after the span element.
- to the extent this is allowed, XML and other vocabularies should specifically address the meaning of elements and attribute value starting with Grapheme_Extend characters (and perhaps prohibited everywhere else). For example HTML might allow Grapheme_Extend characters to start the content of a 'span' element but not any other element and not any markup or attribute values. When starting a 'span' element,
- Prohibit authors from using canonical singletons in markup
- Discourage authors from using canonical singletons in content
- Prohibit authors from using compatibility decomposing characters in markup
- Discourage authors from using compatibility decomposing characters in content (except for those compatibility decomposing characters in the 4th list above: e.g., symbolic graphemes, Roman numerals, ideographic symbols, ligating math symbols, decorated underlines and overlines)
- Implementation Criteria
- Require parsers all perform NFW3C normalization for all markup (not required for any subsequent string concatenation which would remain the authors responsibility)
- STRONGER: Require parsers all perform NFW3C normalization for all content
- WEAKER: Require all canonically equivalent character sequences match except for singleton differences
- For collation purposes, the singletons can be treated as having the same value but always collated after the preferred singleton character
- Require parsers all perform NFW3C normalization for all markup (not required for any subsequent string concatenation which would remain the authors responsibility)
Liaison with Unicode
Possible changes to Unicode for 5.2 or 6.0:
- Change the canonical singleton assignment practice (which today mostly affects Han Ideographs) so that every canonical singleton is simultaneously assigned along with a newly registered variation selector for the canonical decomposition character and then make the newly assigned canonical singleton decompose to the two character sequence of the canonical equivalent plus the registered variation selector
- STRONGER:
- stop assigning characters for immediate canonical singleton decomposition and instead ONLY register a new variation selector for each character (for round-tripping from other character sets)
- register variation selectors for existing canonical singletons and deprecate the canonical singleton characters
- perhaps add new specialized compatibility variation selectors if such a distinction is important
- STRONGER:
- Introduce new normalization forms (such as the suggested NFW3C) that separate singleton-style weak canonical equivalence from the stronger canonical equivalence of precomposed characters on one side to the various decomposed character sequences with different combining marks ordered differently on the other.
- Require Unicode conforming processes to only produce canonically combining class ordered characters for interchange.
- STRONGER: Require Unicode conforming processes to produce NFW3C (or alternately require NFW3D) normalized content.
- STRONGER: Require Unicode conforming processes to produce NFC (or alternately require NFD) normalized content.
- Require Unicode input systems (for the stronger NFC/NFD requirements instead of the NFW3C/NFW3D) to map all canonically decomposable singletons to their canonical decomposition character (for example in character palette input systems).
- Require Unicode font vendors to use the same glyphs for canonically equivalent character sequences (non-singletons) regardless of the canonical ordering of the combining characters for the targeted version of Unicode
- STRONGER: Require Unicode font vendors to use the same glyphs for canonically equivalent character sequences (including singletons) regardless of the canonical ordering of the combining characters for the targeted version of Unicode
- Recommend font vendors consider compatibility decompositions when designing glyphs for similar characters and when mapping glyphs to characters (e.g.,"1⁄2" should use the same glyphs or glyph parts and comparable positioning as "½").
- Recommend Unicode input systems omit compatibility characters except when including them in an expert or advanced mode or through some escape entry of characters
- Some characters such as those listed above for legacy text imaging system might be best left out of input systems entirely even for expert/advanced mode since those glyphs are available with normal Unicode text processing and imaging (for example using a zero-width non-joiner to invoke various Arabic glyph forms).
Relevant Threads/Messages
- fantasai: Regarding need for narrower lossless normalization form
- Ambrose Li: On semantic and visual distinction of canonical equivalent singletons making NFC a lossy normalization