W3C Architecture DomainW3C Internationalization (I18n) Activity: Making the World Wide Web truly world wide!

Related links

Other reviews

Review radar

Core WG home page

Internationalization Comments on XQuery 1.0 and XPath 2.0 Full-Text 1.0

Version reviewed: http://www.w3.org/TR/2007/WD-xpath-full-text-10-20070518/
Lead reviewer and date of initial review: Felix Sasaki, Jun 2007

These are comments on behalf of the Internationalization Core WG, unless otherwise stated. The "Owner" column indicates who has been assigned the responsibility of tracking discussions on a given comment.

We recommend that responses to the comments in this table use a separate email for each point. This makes it far easier to track threads. Click on the icons in the right-most column to see email discussions.

ID Location Subject Comment Owner Ed. /
1 3.3.6 language matching in XPath/XQuery

Section 3.3.6 defines a language option and is used to select language-specific behaviors, such as selecting stop word lists. We believe that implementations should be advised to implement one of the matching schemes defined in BCP 47 (in RFC 4647) when selecting content or behavior. That's because the specific language requested may not be available.

We suggest the following wording:

The language option specified might not exactly match the available language resources. An implementation MAY use language tag matching (such as one of the algorithms defined in [BCP 47, currently "Part II, RFC 4647"]) to determine the best available match. Matching and defaulting behavior are implementation defined.

2 1.1 Notion of language

The second numbered item in the first list contains:

There is an expectation that a full-text search will support language-based searches which substring search cannot.

We propose to add at the end of that list item the following sentence:

Note that language is used as a broad term here and throughout this document. Language information can encompass information about the users language, region, scripts, variants etc., which all influence the search result. Compare for example a search for color which gives a result for American English, but not for British English.

Rationale: We think that readers should be made aware of the information beyond language used in search.

3 2.2.1, and 3.3.6 xml:lang vs. language option in a query

XQuery 1.0 and XPath 2.0 Full-Text allows using language information as an input to a query, to trigger e.g. the choice of a language-specific stop word list:

//p ftcontains "salon de the"
with default stop words language "fr"

Language information can be given in the static context as described in sec. 3.3.6, in a query as described in sec. 2.2.1, and via xml:lang in the target document of a query.

We think that the relation between language information given via xml:lang and other language information (given static or in a query) should be made explicit. An example of a statement where this relation is not explicit is in sec. 2.2.1:

An XQuery 1.0 and XPath 2.0 Full-Text processor SHOULD try to use the information available in xml:lang for processing of collations, as well as the various match options defined in Section 3.3 Match Options.

Questions which arise are: what has higher precedence (xml:lang or the language option)? Do you assume inheritance of the language option and / or xml:lang? Given a query like

//p ftcontains "salon de the"
with default stop words language "fr"

and xml:lang="en" at the root of the document: what language would you assume for salon de the in

<p>salon de the</p>


<p>... <phrase xml:lang="en">salon de the</phrase> ...</p>

In 3.3.6 you write that the relation between xml:lang and the language option is implementation-defined. However, we think that you should specify the precedence / inheritance behaviour of the language option and xml:lang.

4 General Need for the term "word"?

Throughout the document, you use the terms "word" and "token" interchangebly. We propose to drop the term "word", since (as you note in sec. 1.1) it is language-specific. This proposal includes renaming of expressions like FTWords to FTTokens or FTTokenValues.

The background of this proposal becomes obvious in sec. 3.2. That section defines the FTWords expression, but actually uses only the notion of tokens.

5 3.3.6 xs:language vs. xml:lang

You write:

The StringLiteral following the keyword language designates one language. It must be castable to "xs:language"

We propose to use the data type definition for xml:lang instead of xs:language, that is: a union type of xs:language and the empty string. The empty string is necessary to express that no language information is associated with the content. This can be useful to trigger e.g. language-independent tokenization.

6 3.3.5 White space in stop words

You describe a stop words as a literal. We wonder whether a stop word is allowed to contain white space. The last example in sec. 3.3.5 looks like as if white space is used as a separator between the stop words the then. This contradicts the description that stop words are a comma-separated list of string literals. It really should be strictly comma separated.


Contact: Richard Ishida (ishida@w3.org).