11272 – [FT] Tokenization and wildcards

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11272 - [FT] Tokenization and wildcards

Summary: [FT] Tokenization and wildcards

Status:	CLOSED WORKSFORME

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jim Melton
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-11-09 11:32 UTC by Tim Mills
Modified:	2010-11-23 16:47 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Tim Mills 2010-11-09 11:32:19 UTC

It is also unclear whether query and search context tokenization is necessarily the same function, and how matching and implementation-defined tokenization interact.

Section 4.1 Tokenization seems to address only the requirements of search context tokenization (identification of tokens with position, sentence and paragraph), and suggests a function of the form

declare function fts:tokenize( $searchContext as item(),
                               $language as xs:string? ) 
  as element(fts:tokenInfo)* external;

$language is an argument, because Section 3.4.1 Language Option states that the language options can affect tokenization.

Section 3.2 states:

"Otherwise, each of those strings is tokenized into a sequence of tokens as described in Section 4.1 Tokenization. "

However, tokenization of the search tokens must use a different process, because it must vary depending on the wildcard option and doesn't attempt to identify sentence and paragraph boundaries, returning fts:queryToken values rather than fts:tokenInfo values, .  This suggests a function of the form:

declare function fts:tokenizeQuery( $ftWordsValue as xs:string*,
                                    $language as xs:string?,
                                    $wildcardOptionEnabled as xs:boolean ) 
  as element(fts:queryToken)* external;

The $wildcardOptionEnabled argument specifies how the query tokenizer should handle wildcard indicators.

Is my understanding correct?

Comment 1 Michael Dyck 2010-11-10 00:14:51 UTC

[Personal response.]

Yes, I believe your understanding is more-or-less correct.

Note that tokenization of query strings is not addressed formally by the section 4 semantics. In section 4.3 (FTContainsExpr), it occurs at step 3.b, but formal treatment is confined to what happens in step 4.

Comment 2 Tim Mills 2010-11-10 09:03:03 UTC

Thanks.

From what you've said, it seems a little strange that Section 3.2 (Search Tokens and Phrases) refers the reader back to 4.1 (Tokenization).

Comment 3 Mary Holstege 2010-11-23 16:16:43 UTC

The tokenization section (4.1) does talk about tokenization generally (in particular see the note at the end of the section), and since query tokenization and document tokenization need to be consistent, we feel the forward reference is appropriate.

As such, we are resolving this as "WORKSFORME". Please indicate your acceptance of this resolution by closing the bug.

Comment 4 Tim Mills 2010-11-23 16:47:18 UTC

As Comment #1 points me in the right direction, I'll close the bug.

However, I see no reason why the tokens generated for tokenization of the query text need to use a process consistent with search context provided that fts:matchTokenInfos can mediate between them.  For example, consider a cross-language IR system in which the search context text is in one language and the query text in another.

BTW, I found the definition of the semantics using XQuery functions and declarations to be very helpful.  Perhaps the declarations for tokenization functions could be included in a future version of the text.