Re: comments on draft-newman-i18n-comparator-05.txt from Arnt Gulbrandsen on 2005-09-22 (public-ietf-collation@w3.org from September 2005)

From: Arnt Gulbrandsen <arnt@gulbrandsen.priv.no>
Date: Thu, 22 Sep 2005 14:48:43 +0200
To: Martin Duerst <duerst@it.aoyama.ac.jp>
Cc: Philip Guenther <guenther+collation@sendmail.com>, public-ietf-collation@w3.org
Message-Id: <o2iGSXY2Jv9MVsE2YN5zUw.md5@libertango.oryx.com>
(I am amazed. Quick response on a collation question.)

Martin Duerst writes:
> At 17:24 05/09/22, Arnt Gulbrandsen wrote:
> >1. Collators should get octet strings from the protocol.
>
> Sorry, but this assumption isn't generally true. In XQuery, collations 
> are always applied to (Unicode) character strings. This is somewhat 
> due to the fact that XQuery isn't a protocol. But there is no need to 
> restrict the use of collators to protocols.

I agree and disagree.

First, I was using the word protocol in the draft's sense, which is so 
wide that I had protocol for lunch today ;)

Second, while others may indeed use collators, what RFCs define is 
protocols. Anything else that can get a free ride is a (highly 
desirable) added bonus. We can care about XQuery.

> Up to now, my assumption was that all collations operate on character 
> strings,

(mine too)

> and that the 'octet' collation was either a bad name or the exception 
> that proved the rule (until recently, I didn't get much of an answer 
> on that from Chris).

i-octet, ascii-numeric and Cyrus' date collator (for sieve) persuade me 
that this isn't so. The very raison d'etre for a collator is that it is 
NOT strcmp(). The collator draft/RFC defines a small API, and a 
collator is something that implements that API on a given data type. 
That data type may be "Turkish unicode text" or it may be "email 
addresses" or it may be "numbers" or it may be "arbitrary octet 
strings" or it may be "US street addresses" or it may be "Swiss 
telephone directory entries".

> No, my understanding would be that they have to parse the *character* 
> string and work on the resulting value. (note that any serious 
> collator has to in one way or another parse the string, e.g. to 
> separate base letter, diacritics, and case, or whatever).

If the collator gets a character string from the protocol, then a) the 
protocol either cannot work on octets, or b) it has to weed out octet 
strings that don't correspond to character strings before using the 
collator.

B is a design violation in my view. It implies that something outside 
the collator performs a duty specified by the collator.

A is useless to IMAP. IMAP clients _can_ work on non-text body parts.

My conclusion is that it's better to define collation in terms of the 
octet and encoding.

[ about date-time collation ]
> Still, the input would be a character string, wouldn't it?

It would be an octet string. Whether it also would be a character string 
is open to discussion.

If we say collators get character strings, then the protocol has to know 
the character encoding used. In the case of unicode/utf-8, the protocol 
has to parse the octet string, make sure there are no illegal 
sequences, make sure the octet string does not end in the middle of a 
utf-8 character, and only then can it give the character string to the 
collator. For some implementations this is not an added burden, for 
others it is.

If we say octet strings, then "illegal UTF-8 sequence" becomes the same 
error as "non-digit in number" and "month > 12 in date-time" and so on. 
I think that's an attractive regularity. In that case, the collator 
defines what its legal input is, and the collator checks that its input 
is legal.

Of course, implementations are free to optimise by converting/checking 
input anywhere else. All this affects is the definition held by IANA.

> (internal date formats that are not character strings are usually 
> constructed so that sorting is trivial, i.e. no parsing needed).

Cyrus' example wasn't. He specifically mentioned using only the date 
part of a date-time. String-based equality testing won't do that.

> I'm not sure I agree. It looks like an interesting generalization, but 
> I don't think we need to go that far just to solve the i;octet issue. 
> Also, it no longer cover the issue of using a numeric collator for 
> cases such as XQuery, and even simple cases such as a Unix sort 
> command (immagine that it would come with an option to specify a 
> collator for a field).

I don't understand what you're trying to say here.

> If you expand the model, there are a lot of other cases where formats 
> may not match or there is a domain problem.
>
> Thinking about it a bit in the last few days, the i;octet collator's 
> problem isn't the lack of domains, it's that there are two domains 
> for it. As an example, consider a set of strings encoded in 
> UTF-16-BE. Should i;octet be applied to the raw binary form, or 
> should it be applied after converting to UTF-8. The later results in 
> a simple ordering by Unicode codepoint, the former doesn't.

The former, because in the statement "compare these two strings using 
i;octet" there is no implication that both strings are UTF-16-BE.

Using an IMAP example, it's not unreasonable to say "find the messages 
which have a bodypart whose first four bytes are 0xFF 0xD8 0xFF 0xE0". 
To do that, we need a collator that does not assume its input to use 
any particular encoding. i;octet is the natural candidate.

> We definitely need a predefined (and hopefully easy to understand) 
> name for the later. If there is any protocol/format/language that 
> needs the former, I think they should get it, but they should have to 
> explicitly mention that, and they should be aware of the fact that 
> they are committing a layer violation.

I honestly don't see a layer violation.

> This is not just theory: Many implementations these days read in data 
> and convert it to (their preferred form of) Unicode before doing 
> anything else with it, and having to reconstruct the original octet 
> sequence may be impossible or extremely annoying.

I agree that many implementations convert all text to unicode on input; 
I've written enough myself. (I happen to know that both the MUA and MTA 
I use do that.) I do not think that all _input_ is converted to 
unicode. i;octet is there for when we need to sort or compare data 
without assuming that it's text.

For something like XQuery, it is my understanding that all its input may 
be text. For a "protocol" which never operates on non-text, i;octet 
(and other non-text collators) are out of scope. I suppose some 
unicode-hater will define collators on e.g. GB18030. For an 
implementation that never supports GB18030, such collators are also out 
of scope.

Right now, I believe it's very difficult to escape implementing i;octet. 
I guess that needs changing.

Arnt
Received on Thursday, 22 September 2005 12:53:31 UTC