The behaviour normalize-unicode is defined by the unicode normalization specification.
Based on my (somewhat woolly) understanding of the unicode specification, there are 66 codepoints that do not map to characters, and unicode normalization is only defined on strings of characters. Although use of these is not recommended, they are valid XML characters.
xs:string contains a string of codepoints, which can quite happily include noncharacters.
For example what should happen with the following query?
It is worth noting that in .NET, the following expression throws an exception:
I am somewhat loathe to catching this exception and adding a workaround when it is clear that these characters are a bad thing.
Perhaps it is worth allowing implementations to raise an error if these characters appear in a string that is to be normalized, as the result is not a valid unicode string.
On a similar note, Constr-cont-document-3 has some of these characters in its expected result, and I believe that canonicalization is not defined on these characters for a similar reason.
I would think that the result of normalizing codepoints that are not assigned to any character should be to leave the codepoint unchanged in the result.
I'm pretty sure this is what the reference implementation from the Unicode Consortium does.
It's always a problem of course when you want to reuse library code that has made a different decision, but I think that as with regexes, we should avoid letting that implementation concern influence our spec.
And I don't think we should be making value judgements that certain codepoints or characters are bad. Some of our users might think it good that there are codepoints they can use as they like. For better or worse we've chosen to make them legal, and that's good enough.
I haven't researched how these Unicode definitions may have evolved since XQuery 1.0/XSLT 2.0/XPath 2.0 were first published, but according to section 3.11 of Unicode 5.2:
The specification of Unicode Normalization Forms applies to all Unicode coded character sequences (D12). For clarity of exposition in the definitions and rules specified here, the terms character and character sequence are used, but coded character sequences refer also to sequences containing noncharacters or reserved code points. Unicode Normalization Forms are specified for all Unicode code points, and not just for ordinary, assigned graphic characters.
My knowledge of unicode is sketchy at best. It seems that (as has been highlighted by comment #4) that my initial assumption that normalization is not defined on the noncharacters was incorrect.
In light of this I am willing to concede that this is properly specified, and that it is just unfortunate that the .NET library function does not perform unicode normalization exactly as is required. Whilst it is unfortunate is not insufferable.
It is probably worth a test case though to illustrate this potential problem in some implementations.
At telcon 419 the following action was assigned:
ACTION A-419-03 Michael Kay (bug 7935) to propose a non-normative
clarification of normalize-unicode to explain the effect on unassigned
codepoints (viz, that the function is idempotent).
Proposed resolution (using the 2.0 baseline):
1. Update the reference to charmod to refer to the latest draft at http://www.w3.org/TR/charmod-norm/ with a dated reference (because it is a draft), and observe that this specification has not progressed beyond draft status.
2. For normalization forms NFC, NFD, NFCK and NFDK refer directly to http://www.unicode.org/reports/tr15/tr15-25.html rather than to CharMod. Retain the reference to CharMod only for FULLY-NORMALIZED.
3. Add the following statement: It is implementation-defined which version of Unicode (and therefore, of the normalization algorithms and their underlying data) is supported by the implementation. See [TR15] for details of the stability policy regarding changes to the normalization rules in future versions of Unicode. If the input string contains codepoints that are unassigned in the relevant version of Unicode, or for which no normalization rules are defined, the fn:normalize-unicode function leaves such codepoints unchanged. If the implementation supports the requested normalization form then it MUST be able to handle every input string without raising an error.
Agreed with caveat: need to check that the "implementation-defined" version of Unicode is the same wording as used elsewhere including an earliest acceptable Unicode version.
The current status is as follows:
(a) For the 3.0 version of the specification, the changes in items 1 and 2 of comment 4 have been applied to the specification, but item 3 has not. It is now listed in the changes.txt file as needing attention.
(b) For the 1.0/2.0 version of the specification, the change does not appear in the published Second Edition; it has now been added to the list of candidate errata.
The change in item 3 of comment 4 has now been (belatedly) added to the 3.0 specification.