<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>11272</bug_id>
          
          <creation_ts>2010-11-09 11:32:19 +0000</creation_ts>
          <short_desc>[FT] Tokenization and wildcards</short_desc>
          <delta_ts>2010-11-23 16:47:18 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>Full Text 1.0</component>
          <version>Candidate Recommendation</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows NT</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>WORKSFORME</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Tim Mills">tim</reporter>
          <assigned_to name="Jim Melton">jim.melton</assigned_to>
          <cc>holstege</cc>
    
    <cc>jmdyck</cc>
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>42310</commentid>
    <comment_count>0</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2010-11-09 11:32:19 +0000</bug_when>
    <thetext>It is also unclear whether query and search context tokenization is necessarily the same function, and how matching and implementation-defined tokenization interact.

Section 4.1 Tokenization seems to address only the requirements of search context tokenization (identification of tokens with position, sentence and paragraph), and suggests a function of the form

declare function fts:tokenize( $searchContext as item(),
                               $language as xs:string? ) 
  as element(fts:tokenInfo)* external;

$language is an argument, because Section 3.4.1 Language Option states that the language options can affect tokenization.

Section 3.2 states:

&quot;Otherwise, each of those strings is tokenized into a sequence of tokens as described in Section 4.1 Tokenization. &quot;

However, tokenization of the search tokens must use a different process, because it must vary depending on the wildcard option and doesn&apos;t attempt to identify sentence and paragraph boundaries, returning fts:queryToken values rather than fts:tokenInfo values, .  This suggests a function of the form:

declare function fts:tokenizeQuery( $ftWordsValue as xs:string*,
                                    $language as xs:string?,
                                    $wildcardOptionEnabled as xs:boolean ) 
  as element(fts:queryToken)* external;

The $wildcardOptionEnabled argument specifies how the query tokenizer should handle wildcard indicators.

Is my understanding correct?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>42334</commentid>
    <comment_count>1</comment_count>
    <who name="Michael Dyck">jmdyck</who>
    <bug_when>2010-11-10 00:14:51 +0000</bug_when>
    <thetext>[Personal response.]

Yes, I believe your understanding is more-or-less correct.

Note that tokenization of query strings is not addressed formally by the section 4 semantics. In section 4.3 (FTContainsExpr), it occurs at step 3.b, but formal treatment is confined to what happens in step 4.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>42338</commentid>
    <comment_count>2</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2010-11-10 09:03:03 +0000</bug_when>
    <thetext>Thanks.

From what you&apos;ve said, it seems a little strange that Section 3.2 (Search Tokens and Phrases) refers the reader back to 4.1 (Tokenization).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>42723</commentid>
    <comment_count>3</comment_count>
    <who name="Mary Holstege">holstege</who>
    <bug_when>2010-11-23 16:16:43 +0000</bug_when>
    <thetext>The tokenization section (4.1) does talk about tokenization generally (in particular see the note at the end of the section), and since query tokenization and document tokenization need to be consistent, we feel the forward reference is appropriate.

As such, we are resolving this as &quot;WORKSFORME&quot;. Please indicate your acceptance of this resolution by closing the bug.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>42726</commentid>
    <comment_count>4</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2010-11-23 16:47:18 +0000</bug_when>
    <thetext>As Comment #1 points me in the right direction, I&apos;ll close the bug.

However, I see no reason why the tokens generated for tokenization of the query text need to use a process consistent with search context provided that fts:matchTokenInfos can mediate between them.  For example, consider a cross-language IR system in which the search context text is in one language and the query text in another.

BTW, I found the definition of the semantics using XQuery functions and declarations to be very helpful.  Perhaps the declarations for tokenization functions could be included in a future version of the text.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>