9903 – Need to define specification and use of collations for string sorting

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9903 - Need to define specification and use of collations for string sorting

Summary: Need to define specification and use of collations for string sorting

Status:	RESOLVED FIXED

Alias:	None

Product:	WebAppsWG
Classification:	Unclassified
Component:	Indexed Database API (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P1 normal
Target Milestone:	---
Assignee:	Pablo Castro
QA Contact:	public-webapps-bugzilla

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-06-10 19:00 UTC by Pablo Castro
Modified:	2011-06-22 22:13 UTC (History)
CC List:	8 users (show)

See Also:

Attachments

Description Pablo Castro 2010-06-10 19:00:11 UTC

Currently the spec does not define what collation algorithm should be used for string sorting. In order to properly support international applications we need to define both how implementations should use collations and how the API exposes options for the developer to choose a collation for a certain scope (database, index, etc.)

Original WG message for context:

From: public-webapps-request@w3.org [mailto:public-webapps-request@w3.org] On Behalf Of Mikeal Rogers
Sent: Wednesday, June 09, 2010 2:42 PM
To: Webapps WG
Subject: [IndexDB] Collation Algorithm?

One of the things I noticed that seems to be missing from the IndexDB
specification is the collation algorithm used for sorting the index
keys.

There are lots of collation differences between databases, if left
unspecified I'm afraid this would negatively affect interoperability
between IndexDB implementations.

CouchDB has a good collation specification for rich keys (any JSON
type) and defers to the Unicode Collation Algorithm once it hits
string comparisons. This might be a good starting point.

http://wiki.apache.org/couchdb/View_collation#Collation_Specification

http://www.unicode.org/reports/tr10/

-Mikeal

Comment 1 Nikunj Mehta 2010-06-15 20:00:55 UTC

The spec already does what the bug asks for. Check WebIDL for more help with
access constants defined in interfaces.

Comment 2 Nikunj Mehta 2010-06-15 20:01:46 UTC

Wrong comment and closure

Comment 3 Pablo Castro 2010-11-20 02:58:01 UTC

UPDATE:

We discussed this at the TPAC and arrived to a plan for allowing developers specify the language that should be used for sorting. I'll come up with the actual spec text later, but to put it out there in case folks have feedback:

Specifying language:
- language used for collation is a database-wide setting
- by default databases use binary collation
- developers can set the collation language in a setVersion transaction
- IDBDatabase exposes setLanguage(string) (only callable from setVersion) and a "language" attribute that can be queried at any time

How the database uses the language specification:
- implementations should honor the language specification, but at this time we won't pick a particular algorithm for doing so


As a side note, I asked around to see what was the state of international language support in JavaScript. It looks like there is some work ongoing (I heard "ECMAString i18n" but not a lot of details), but won't land for some time.
 
Anything missing? I'll come up with spec language next week.

Comment 4 Jeremy Orlow 2010-12-10 12:22:36 UTC

Are you planning on working on this soon?  If not, maybe you should assign it to dave.null@w3.org

Comment 5 Jungshik Shin 2011-02-19 07:36:33 UTC

ECMAScript I18N API proposal (mentioned in comment #3) is at http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api

The locale-dependent collation support is one of APIs we have consensus on supporting. See the section on LocaleInfo.Collator. The current proposal was significantly scaled down from the original proposal to include only what can be implemented 'quickly' by multiple implementors. We're internally referring to it as v 0.5. 

As for comment 3, 'language' may not be the best term to use. Better would be BCP 47 locale identifiers ( ftp://ftp.rfc-editor.org/in-notes/bcp/bcp47.txt ). Because language alone is not sufficient to specify the sorting order. For instance, German has two sorting orders (dictionary and phonebook) while Chinese can be sorted in a few different ways (radical-stroke count, total stroke count, pinyin order). When the collation in IndexedDB was brought up on public-webapps, I suggested that BCP 47 locale identifier be used, but I haven't paid attention since. 

In addition, has there been any discussion about other 'attributes' that affect collation? For instance, collation strength (whether or not to take into account 'accent', case-sensitivity, etc).

Comment 6 mark 2011-02-22 21:53:02 UTC

To add to what Jungshik said, BCP47 defines standard extensions. The extension defined by the Unicode consortium (http://cldr.unicode.org/index/bcp47-extension) provides for fine-grained specifications of collation behavior.

Examples for German:

de-u-co-phonebk // phonebook order
de-u-kn-true // numeric sorting, eg Tom2 comes before Tom12
de-u-ks-level1 // ignore accents, case differences
de-u-ks-level2 // ignore case differences
de-u-ks-level1-kc-true // ignore accents, but not case

These can be combined, such as:

de-u-co-phonebk-kn-true-ks-level1-kc-true

Comment 7 Pablo Castro 2011-06-20 17:26:21 UTC

It looks like we landed on the following:
- we won't have a way for developers to indicate collation
- we'll use a fixed "binary" collation
- for the specifics of sort order we'll follow what's described in section 11.8.5 of the ECMAScript spec [1].

Given this, there are no actual API changes that need to be done in the spec, but we need to describe the collation that UAs should use as above.

This should go into section 3.1.3 (Keys) where the rest of the comparison concerns are discussed. Will update the spec next.

[1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf

Comment 8 Eliot Graff 2011-06-22 22:13:51 UTC

Section 3.1.3 now contains the following reference to the ECMA algorithm agreed upon in mail [1] and in comment 7. Note the second to last sentence:

]]
For purposes of comparison, all Arrays are greater than all DOMString, Date and float values; all DOMString values are greater than all Date and float values; and all Date values are greater than all float values. Values of type float are compared to other float values numerically. Values of type Date are compared to other Date values chronologically. Values of type DOMString are compared to other values of type DOMString by using the algorithm defined by step 4 of section 11.8.5, The Abstract Relational Comparison Algorithm, of the ECMAScript Language Specification [ECMA-262]. Values of type Array are compared to other values of type Array as follows: 
[[

[1] http://lists.w3.org/Archives/Public/public-webapps/2011AprJun/1091.html

Resolving as FIXED