This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14227 - [FT3] Full Text language option should address synonymy
Summary: [FT3] Full Text language option should address synonymy
Status: RESOLVED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 3.0 (show other bugs)
Version: Working drafts
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: Mary Holstege
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-09-20 17:48 UTC by C. M. Sperberg-McQueen
Modified: 2012-01-24 16:29 UTC (History)
1 user (show)

See Also:


Attachments

Description C. M. Sperberg-McQueen 2011-09-20 17:48:35 UTC
In the joint call of 20 September I was asked to raise a bug against Full Text's description of the language option.  Specifically, the text of the section on the language option needs to address the question of what to do when there are both two- and three-letter codes for a language (i.e. which should be used?)  The text of any description of the feature names used for language support, as sketched in Mary Holstege's mail at

  http://lists.w3.org/Archives/Member/w3c-xsl-query/2011Sep/0224.html

may also need to address this question -- at the very least it should be consistent with the language option.

The value of the language option is required to be castable to xs:language, which means that its semantics eventually are based on RFC 3066 (in XSD 1.0) or its successor BCP 47 (in XSD 1.1).  

BCP 47 already addresses the question of preferring the two- or three-letter codes; it describes rules for a Preferred-Value field in the IANA Language Subtag Registry.  So in some sense, if we assume that the recommendations of BCP 47 are binding on the formulation of values for the language option and features, we may infer that FT already addresses the topic and there is not really any bug here.  

Empirically, however, today's call provides some evidence for the claim that the FT spec does not make its position on the matter adequately clear.  So perhaps it would be a good idea if the description of the language option, and the description of the class of feature names based on the language option, were to mention explicitly that where the relevant RFCs define more than one code for a language or language-locale combination, the provisions of BCP47 regarding preferred values SHOULD be followed.  It would be nice if we could then say "For example, prefer 'deu' to 'de'", or "For example, prefer 'de' to 'deu'" -- that would require that someone actually wade through the details of BCP47 and come out the other side with an answer to that question.  

It might also be helpful to remind readers (with an example, or in a note) that the values of the language option might include codes like 'en-US', 'en-CA', and 'en-GB' for a hypothetical implementation with three different tokenizers for U.S. English, Canadian English, and British English.  Note:  I think 'en-GB' is the right way to say 'British English' but if it's not, please substitute the correct way to say it.
Comment 1 Mary Holstege 2012-01-24 16:29:44 UTC
Done.
Added the following text to the language option section:

An implementation MUST treat language identifiers that [BCP 47] defines as equivalent as identifying the same language. For example "mn" and "MN" are equivalent, as language tags are case insensitive, and "de" and "deu" are equivalent, as they are different codes for the same language. However, it is implementation-defined whether an implementation treats a particular language identifier with script, region, or variant portions as equivalent to the language identifier without them. For example, an implementation may treat "en-UK" as equivalent "en" and "en-US" but "sr-Latn" as different from "sr" and "sr-Cyrl".

This text is also referenced in the section on feature setting for language features.