This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.
Currently the spec does not define what collation algorithm should be used for string sorting. In order to properly support international applications we need to define both how implementations should use collations and how the API exposes options for the developer to choose a collation for a certain scope (database, index, etc.) Original WG message for context: From: public-webapps-request@w3.org [mailto:public-webapps-request@w3.org] On Behalf Of Mikeal Rogers Sent: Wednesday, June 09, 2010 2:42 PM To: Webapps WG Subject: [IndexDB] Collation Algorithm? One of the things I noticed that seems to be missing from the IndexDB specification is the collation algorithm used for sorting the index keys. There are lots of collation differences between databases, if left unspecified I'm afraid this would negatively affect interoperability between IndexDB implementations. CouchDB has a good collation specification for rich keys (any JSON type) and defers to the Unicode Collation Algorithm once it hits string comparisons. This might be a good starting point. http://wiki.apache.org/couchdb/View_collation#Collation_Specification http://www.unicode.org/reports/tr10/ -Mikeal
The spec already does what the bug asks for. Check WebIDL for more help with access constants defined in interfaces.
Wrong comment and closure
UPDATE: We discussed this at the TPAC and arrived to a plan for allowing developers specify the language that should be used for sorting. I'll come up with the actual spec text later, but to put it out there in case folks have feedback: Specifying language: - language used for collation is a database-wide setting - by default databases use binary collation - developers can set the collation language in a setVersion transaction - IDBDatabase exposes setLanguage(string) (only callable from setVersion) and a "language" attribute that can be queried at any time How the database uses the language specification: - implementations should honor the language specification, but at this time we won't pick a particular algorithm for doing so As a side note, I asked around to see what was the state of international language support in JavaScript. It looks like there is some work ongoing (I heard "ECMAString i18n" but not a lot of details), but won't land for some time. Anything missing? I'll come up with spec language next week.
Are you planning on working on this soon? If not, maybe you should assign it to dave.null@w3.org
ECMAScript I18N API proposal (mentioned in comment #3) is at http://wiki.ecmascript.org/doku.php?id=strawman:i18n_api The locale-dependent collation support is one of APIs we have consensus on supporting. See the section on LocaleInfo.Collator. The current proposal was significantly scaled down from the original proposal to include only what can be implemented 'quickly' by multiple implementors. We're internally referring to it as v 0.5. As for comment 3, 'language' may not be the best term to use. Better would be BCP 47 locale identifiers ( ftp://ftp.rfc-editor.org/in-notes/bcp/bcp47.txt ). Because language alone is not sufficient to specify the sorting order. For instance, German has two sorting orders (dictionary and phonebook) while Chinese can be sorted in a few different ways (radical-stroke count, total stroke count, pinyin order). When the collation in IndexedDB was brought up on public-webapps, I suggested that BCP 47 locale identifier be used, but I haven't paid attention since. In addition, has there been any discussion about other 'attributes' that affect collation? For instance, collation strength (whether or not to take into account 'accent', case-sensitivity, etc).
To add to what Jungshik said, BCP47 defines standard extensions. The extension defined by the Unicode consortium (http://cldr.unicode.org/index/bcp47-extension) provides for fine-grained specifications of collation behavior. Examples for German: de-u-co-phonebk // phonebook order de-u-kn-true // numeric sorting, eg Tom2 comes before Tom12 de-u-ks-level1 // ignore accents, case differences de-u-ks-level2 // ignore case differences de-u-ks-level1-kc-true // ignore accents, but not case These can be combined, such as: de-u-co-phonebk-kn-true-ks-level1-kc-true
It looks like we landed on the following: - we won't have a way for developers to indicate collation - we'll use a fixed "binary" collation - for the specifics of sort order we'll follow what's described in section 11.8.5 of the ECMAScript spec [1]. Given this, there are no actual API changes that need to be done in the spec, but we need to describe the collation that UAs should use as above. This should go into section 3.1.3 (Keys) where the rest of the comparison concerns are discussed. Will update the spec next. [1] http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf
Section 3.1.3 now contains the following reference to the ECMA algorithm agreed upon in mail [1] and in comment 7. Note the second to last sentence: ]] For purposes of comparison, all Arrays are greater than all DOMString, Date and float values; all DOMString values are greater than all Date and float values; and all Date values are greater than all float values. Values of type float are compared to other float values numerically. Values of type Date are compared to other Date values chronologically. Values of type DOMString are compared to other values of type DOMString by using the algorithm defined by step 4 of section 11.8.5, The Abstract Relational Comparison Algorithm, of the ECMAScript Language Specification [ECMA-262]. Values of type Array are compared to other values of type Array as follows: [[ [1] http://lists.w3.org/Archives/Public/public-webapps/2011AprJun/1091.html Resolving as FIXED