11272 2010-11-09 11:32:19 +0000 [FT] Tokenization and wildcards 2010-11-23 16:47:18 +0000 1 1 1 Unclassified XPath / XQuery / XSLT Full Text 1.0 Candidate Recommendation PC Windows NT CLOSED WORKSFORME P2 normal --- 1 tim jim.melton holstege jmdyck public-qt-comments oldest_to_newest 42310 0 tim 2010-11-09 11:32:19 +0000 It is also unclear whether query and search context tokenization is necessarily the same function, and how matching and implementation-defined tokenization interact. Section 4.1 Tokenization seems to address only the requirements of search context tokenization (identification of tokens with position, sentence and paragraph), and suggests a function of the form declare function fts:tokenize( $searchContext as item(), $language as xs:string? ) as element(fts:tokenInfo)* external; $language is an argument, because Section 3.4.1 Language Option states that the language options can affect tokenization. Section 3.2 states: "Otherwise, each of those strings is tokenized into a sequence of tokens as described in Section 4.1 Tokenization. " However, tokenization of the search tokens must use a different process, because it must vary depending on the wildcard option and doesn't attempt to identify sentence and paragraph boundaries, returning fts:queryToken values rather than fts:tokenInfo values, . This suggests a function of the form: declare function fts:tokenizeQuery( $ftWordsValue as xs:string*, $language as xs:string?, $wildcardOptionEnabled as xs:boolean ) as element(fts:queryToken)* external; The $wildcardOptionEnabled argument specifies how the query tokenizer should handle wildcard indicators. Is my understanding correct? 42334 1 jmdyck 2010-11-10 00:14:51 +0000 [Personal response.] Yes, I believe your understanding is more-or-less correct. Note that tokenization of query strings is not addressed formally by the section 4 semantics. In section 4.3 (FTContainsExpr), it occurs at step 3.b, but formal treatment is confined to what happens in step 4. 42338 2 tim 2010-11-10 09:03:03 +0000 Thanks. From what you've said, it seems a little strange that Section 3.2 (Search Tokens and Phrases) refers the reader back to 4.1 (Tokenization). 42723 3 holstege 2010-11-23 16:16:43 +0000 The tokenization section (4.1) does talk about tokenization generally (in particular see the note at the end of the section), and since query tokenization and document tokenization need to be consistent, we feel the forward reference is appropriate. As such, we are resolving this as "WORKSFORME". Please indicate your acceptance of this resolution by closing the bug. 42726 4 tim 2010-11-23 16:47:18 +0000 As Comment #1 points me in the right direction, I'll close the bug. However, I see no reason why the tokens generated for tokenization of the query text need to use a process consistent with search context provided that fts:matchTokenInfos can mediate between them. For example, consider a cross-language IR system in which the search context text is in one language and the query text in another. BTW, I found the definition of the semantics using XQuery functions and declarations to be very helpful. Perhaps the declarations for tokenization functions could be included in a future version of the text.