3245 – Equality of strings

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3245 - Equality of strings

Summary: Equality of strings

Status:	RESOLVED LATER

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Datatypes: XSD Part 2 (show other bugs)
Version:	1.1 only
Hardware:	PC Windows XP

Importance:	P1 normal
Target Milestone:	---
Assignee:	C. M. Sperberg-McQueen
QA Contact:	XML Schema comments list

URL:
Whiteboard:	cluster: equality
Keywords:	needsDrafting

Depends on:	3222
Blocks:
	Show dependency tree / graph

Reported:	2006-05-09 10:44 UTC by Michael Kay
Modified:	2008-06-13 23:42 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Michael Kay 2006-05-09 10:44:37 UTC

QT approved comment:

In 3.3.1.1, defining equality of strings as identity seems a missed opportunity. Without introducing full collation support, it would be very useful to many users if equality incorporated Unicode normalization. (this is assuming, of course, that equality and not identity is used for things such as enumerations).

Comment 1 David Ezell 2007-04-20 15:21:29 UTC

See also 3224.

Comment 2 C. M. Sperberg-McQueen 2007-04-20 15:36:15 UTC

Thank you; the point about possibly equating strings which have the same
normalized form is an interesting one.

Changing identity constraints to use equality rather than identity
has been discussed, but if all schema tests use equality rather 
than identity, then the distinction between identity and equality
is lost, and we are back where we were in XML Schema 1.0, trying
to square the circle and resolve the conflicting desire to have 
certain things (or pairs of things) be both the same and not the same.
The Working Group has not made any final decision on the issue,
but it is safe to say that discussions have not so far shown consensus
on the idea.

It would be possible, without backward compatibility problems, to
define equality for strings as involving Unicode normalization, but
(as the originators of the comment are aware) since strings are
not ordered by XML Schema, equality of strings is not appealed to at
any point in schema validity assessment.  If it would be useful to
others using XML Schema (e.g. XSL, XQuery, XForms, ...) for XSD
to define string equality in this way, without changing the behavior
of enumerations etc., then please let us know.

Comment 3 David Ezell 2007-08-24 21:10:24 UTC

On the telcon:

The WG notes that the last note from MSM (dated 2007-04-20) the WG offered to consider adding (Unicode based) equality of strings if anyone thought it would help.  

Having no new information, the WG believes that no further action is necessary.

Please let us know if you agree with this resolution of your issue, by adding a comment to the issue record and changing the Status of the issue to Closed. Or, if you do not agree with this resolution, please add a comment explaining why. If you wish to appeal the WG's decision to the Director, then also change the Status of the record to Reopened. If you wish to record your dissent, but do not wish to appeal the decision to the Director, then change the Status of the record to Closed. If we do not hear from you in the next two weeks, we will assume you agree with the WG decision.

Comment 4 Felix Sasaki 2007-09-19 18:45:09 UTC

This is a personal note (I have not yet reopened the issue, if necessary, I would do that after i18n core WG agrees with my comment)
I think it would be useful (Michael mentioned that possibility at http://www.w3.org/Bugs/Public/show_bug.cgi?id=3245#c2 ) to have equality of strings involving unicode normalization. We will try to come up with a WG response at our next call (25 September).
Felix

Comment 5 Felix Sasaki 2007-09-25 15:38:05 UTC

Unfortunately we were not able to discuss this issue on our call this week. We will have it on our agenda again in two weeks.
Sorry for the dealy,
Felix

Comment 6 Felix Sasaki 2007-10-09 15:00:11 UTC

Hello again,
the i18n core Working Group discussed this issue at our call this week. We are re-opening the issue and would like to propose you to add a reference to the section http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching of the "Character Model for the World Wide Web 1.0: Normalization" document, and specify that identity is based on the normalized form. If you think that this would introduce non-conformance, you could say something like "identity SHOULD be based on the normalized form" instead of "identity is (MUST be) based on the normalized form".

I personally would like to apologize that we are so late in re-opening this issue, which is due to the late awareness on our side.
Thank you,
Felix

Comment 7 Dave Peterson 2007-10-09 16:06:18 UTC

(In reply to comment #6)

> the i18n core Working Group discussed this issue at our call this week. We are
> re-opening the issue and would like to propose you to add a reference to the
> section http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching of the
> "Character Model for the World Wide Web 1.0: Normalization" document, and
> specify that identity is based on the normalized form.

If you really mean "identity" (as opposed to "equality"), I suspect it won't happen in this release.  However, if you find "equality" acceptable, it can possibly be done, especially if other issues make it necessary to have another "last call" cycle before going to CR.  Strictly my personal opinion.

Comment 8 Addison Phillips 2007-10-10 02:52:18 UTC

Hi Dave,

The issue here is specifically the one sentence in section 3.3.1.1 of:

 http://www.w3.org/TR/xmlschema11-2/#string

Here we find this little five word sentence:

 "Equality for string is identity"

The problem is that we read this to mean that two strings are equal if they consist of the same sequence of characters. The I18N WG has long held that string identity, in a Unicode context, needs to consider normalization. Otherwise certain languages that typically use combining sequences will produce false negatives for string equality. 

While this would require maybe a very few additional words in XML Schema and introduces few, if any, additional requirements for implementations, it *does* have an effect in the many technologies that depend on XML Schema.

If string identity means the same character sequence, XML Schema really should point out:

 http://www.w3.org/TR/charmod-norm/#sec-IdentityMatching

Comment 9 C. M. Sperberg-McQueen 2007-10-14 19:56:31 UTC

The WG discussed this issue both with Query and XSL, and then among
ourselves, at the October 2007 ftf meetings in Redmond.  See also bug
3222, which is closely related in practice.

We discussed several proposals for defining equality conditions for
string which might depend on normalization and/or collation
information.  Eventually, we converged on a proposal to add a
Unicode-normalization facet applicable to xs:string. Its value will be
an identifier denoting a specific Unicode collation form (e.g. 'c').
To begin with, the only legal values will be the identifier for
normalization form C and ABSENT.  The default value will be ABSENT,
which means the unnormalized form is used.  Once specified, the facet
cannot be changed (it's effectively fixed from the time of first use).
The meaning of the facet is that the lexical form is prepared by
calculating the named normaliztion form for the 'normalized value' in
the input infoset, and then performing whitespace normalization to
calculate the candidate lexical form.
 
We noted that it does matter that Unicode normalization be done first:
For the string s = x, y, z, space, space, combining umlaut, x, y, z,
it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z,
while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z.  We
thought that in this case the double space in the original seems a
clear signal that two tokens are intended, not one.

After the meeting, it occurred to some WG members that it might be
good to have an explicit identifier for no-normalization, so that the
value of the facet can be fixed that way if desired.  (This would
entail reformulatiing the rule about changing the facet:  the value
might change from no-normalization to some normalization form, but
not from any specified normalization form to any other value.)

Comment 10 Felix Sasaki 2007-10-17 06:00:59 UTC

(In reply to comment #9)
> The WG discussed this issue both with Query and XSL, and then among
> ourselves, at the October 2007 ftf meetings in Redmond.  See also bug
> 3222, which is closely related in practice.
> 
> We discussed several proposals for defining equality conditions for
> string which might depend on normalization and/or collation
> information.  Eventually, we converged on a proposal to add a
> Unicode-normalization facet applicable to xs:string. Its value will be
> an identifier denoting a specific Unicode collation form (e.g. 'c').
> To begin with, the only legal values will be the identifier for
> normalization form C and ABSENT.  The default value will be ABSENT,
> which means the unnormalized form is used.  Once specified, the facet
> cannot be changed (it's effectively fixed from the time of first use).
> The meaning of the facet is that the lexical form is prepared by
> calculating the named normaliztion form for the 'normalized value' in
> the input infoset, and then performing whitespace normalization to
> calculate the candidate lexical form.
> 
> We noted that it does matter that Unicode normalization be done first:
> For the string s = x, y, z, space, space, combining umlaut, x, y, z,
> it's clear that norm(ws(s)) = x, y, z, non-combining umlaut, x, y, z,
> while ws(norm(s)) = x, y, z, space, non-combining umlaut, x, y, z.  We
> thought that in this case the double space in the original seems a
> clear signal that two tokens are intended, not one.
> 
> After the meeting, it occurred to some WG members that it might be
> good to have an explicit identifier for no-normalization, so that the
> value of the facet can be fixed that way if desired.  (This would
> entail reformulatiing the rule about changing the facet:  the value
> might change from no-normalization to some normalization form, but
> not from any specified normalization form to any other value.)
> 

The following is not my comment, but a copy from http://lists.w3.org/Archives/Public/public-i18n-core/2007OctDec/0005.html .

Felix

There is a factual problem in the example.

The normalized form of <space, combining umlaut> is <space, combining umlaut> in all cases; it does not change under normalization. The normalized form of <00a8> remains the same (00a8) under NFC and NFD: it only changes in the compatibility forms to <space, combining umlaut>. So if normalization form C is being discussed, then the example needs to be changed.

If you have any questions about particular normalizations, the icu browser is helpful.

http://demo.icu-project.org/icu-bin/nbrowser

Mark

Comment 11 C. M. Sperberg-McQueen 2008-06-13 23:42:07 UTC

The Working Group discussed this issue at its teleconference of 13 
June 2008.  With regret we noted that we have not been able to produce
a draft wording proposal for this issue and that we do not feel able
to delay the Last Call publication of the spec any longer.  So we
are reluctantly changing the status of this issue to LATER in the hopes
of being able to come back to it.

The chair of the Working Group has been assigned an action to notify
the XML Query and XSL Working Groups, as the originators of the issue,
that we do not expect after all to include a facet of Unicode 
normalization of strings in XSD 1.1.