8456 – [FO 1.1] Behaviour of 'FULLY-NORMALIZED' not well defined in fn:normalize-unicode

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 8456 - [FO 1.1] Behaviour of 'FULLY-NORMALIZED' not well defined in fn:normalize-unicode

Summary: [FO 1.1] Behaviour of 'FULLY-NORMALIZED' not well defined in fn:normalize-uni...

Status:	RESOLVED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 3.0 (show other bugs)
Version:	Working drafts
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Michael Kay
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-12-08 12:41 UTC by Oliver Hallam
Modified:	2010-02-09 09:54 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Oliver Hallam 2009-12-08 12:41:36 UTC

The specification of normalize-unicode states:

  Returns the value of $arg normalized according to the normalization criteria for a normalization form identified by the value of $normalizationForm

It also refers to:

  See [Character Model for the World Wide Web 1.0: Normalization] for a description of the normalization forms.

However, consider the following query:

  normalize-string('&#x302;', 'FULLY-NORMALIZED')

Normalizing this string does not produce a fully normalized result.  I assume the correct way to fully normalize this is to add a leading space character, but I cannot see where this behaviour is specified.

Comment 1 Michael Kay 2009-12-08 15:35:40 UTC

This issue is discussed here:

http://lists.w3.org/Archives/Public/public-qt-comments/2003Oct/0198.html

a discussion which started with my observation

"It's not at all clear to me that supporting "fully-normalized" form
makes any sense at all. Whereas the Unicode normalization forms all describe
an algorithm for normalizing data, the "fully-normalized" form is described
only as a property of a string. There is no algorithm provided for making a
string fully-normalized, and the only algorithms that one might come up with
involve losing information."

The next message in the thread summarizes what we concluded about the algorithm:

"... a check that the first character in the string being normalized is
a base character (e.g. has a combining class of 0). If the last test
fails, a space is inserted at the start of the data to carry the
combining mark."

If my memory serves me right, we were assured that the algorithm would be properly described in a future version of CharMod, and we felt that it needed to be fixed in CharMod rather than in our specs. Perhaps that was wishful thinking (many things related to I18N are).

For my own part, if I remember right I decided not to support this optional feature until it was better specified.

Comment 2 Michael Kay 2010-01-13 11:35:43 UTC

I propose that (a) in the 1.0 spec, we don't fix this; (b) in 1.1, we fix it as follows:

Delete the sentence "See [Character Model for the World Wide Web 1.0: Normalization] for a description of the normalization forms."

Substitute "Normalization forms NFC, NFD, NFKC, and NFKD, and algorithms for converting a string to each of these forms, are defined in [Unicode Normalization]." where this is a new normative reference to http://unicode.org/reports/tr15/. Add the standard wording about which version of Unicode may be used.

Add "The motivation for normalization form FULLY-NORMALIZED is described in [charmod-norm]." {which now becomes a non-normative reference} "However, as that specification did not progress beyond working draft status, the normative specification is as follows.

A string is fully-normalized if (a) it is normalization form NFC as defined by [Unicode Normalization], and (b) it does not start with a composing character.

A composing character is a character that is one or both of the following: 

(a) the second character in the canonical decomposition mapping of some character that is not listed in the Composition Exclusion Table defined in [UTR #15], or

(b) of non-zero canonical combining class (as defined in [Unicode]).

A string is converted to FULLY-NORMALIZED form as follows:

(a) if the first character in the string is a composing character, prepend a single space

(b) convert the string to normalization form NFC

Comment 3 Jim Melton 2010-02-05 00:52:59 UTC

In the Joint teleconference of the XSL WG and the XML Query WG on 2010-01-19, minuted at http://lists.w3.org/Archives/Member/w3c-xsl-query/2010Jan/0081.html (member-only link), the proposal in comment 2 was accepted.  As a result, I am marking this bug RESOLVED/FIXED. 

If you agree with the solution adopted, please mark the bug CLOSED.

Comment 4 Michael Kay 2010-02-09 09:54:54 UTC

These changes have now been applied to the baseline spec.