<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>11885</bug_id>
          
          <creation_ts>2011-01-27 08:59:00 +0000</creation_ts>
          <short_desc>[XQFTTS] english-stems.txt stemming dictionary</short_desc>
          <delta_ts>2011-06-28 07:28:14 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>Full Text 1.0</component>
          <version>Proposed Recommendation</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows NT</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Tim Mills">tim</reporter>
          <assigned_to name="Jim Melton">jim.melton</assigned_to>
          <cc>holstege</cc>
    
    <cc>paul</cc>
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>44787</commentid>
    <comment_count>0</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2011-01-27 08:59:00 +0000</bug_when>
    <thetext>The file &quot;english-stems.txt&quot; contains stemming rules only for lower case text.  However, the specification clearly states that the &quot;Stemming Option must be applied before the Case Option and the Diacritics Option&quot;.

So when tokenizing the string &quot;Dogs and Cats&quot; with stemming, the okens presented to the tokenizer must be &quot;Dogs&quot;, &quot;and&quot;, &quot;Cats&quot;.

The guidelines for running XQFTTS state that the &quot;stemming-dictionary is a plain text file containing lines of whitespace-separated tokens. Each token on the line should stem to the first token on the line.&quot;

Note that it is conceivable that the stemming dictionary might stem &quot;AIDS&quot; to &quot;AIDS&quot; but &quot;aids&quot; to &quot;aid&quot;.  This would be a useful test of the order of application of stemming and case options.  Presumably the test suite doesn&apos;t currently test this.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45935</commentid>
    <comment_count>1</comment_count>
    <who name="Mary Holstege">holstege</who>
    <bug_when>2011-02-22 19:44:09 +0000</bug_when>
    <thetext>Correct, stemming should be case-sensitive.  I have added two additional tests
ft-matchoptions-q5 and ft-matchoptions-q6 to test this case.  Please indicate your satisfaction with this resolution by closing this bug.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45937</commentid>
    <comment_count>2</comment_count>
    <who name="Paul J. Lucas">paul</who>
    <bug_when>2011-02-22 20:26:56 +0000</bug_when>
    <thetext>So how would one do case-insensitve stemming?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45954</commentid>
    <comment_count>3</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2011-02-23 10:18:56 +0000</bug_when>
    <thetext>(In reply to comment #1)
&gt; Correct, stemming should be case-sensitive.  I have added two additional tests
&gt; ft-matchoptions-q5 and ft-matchoptions-q6 to test this case.  Please indicate
&gt; your satisfaction with this resolution by closing this bug.

Thanks.  However, for some tests to pass, the english-stems.txt needs to have the following lines added.


Dog Dogs
Cat Cats
Improve Improves Improving Improved
Test Tests Testing Tested 
Plan Planning
Conduct Conducting</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45957</commentid>
    <comment_count>4</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2011-02-23 10:37:44 +0000</bug_when>
    <thetext>(In reply to comment #2)
&gt; So how would one do case-insensitve stemming?

Assuming that

lowercase(AB) = ab
lowercase(Ab) = ab
lowercase(aB) = ab
lowercase(ab) = ab

one would ensure that if the implementation&apos;s stemming algorithm was such that

stem(AB) = AB

then 

stem(Ab) = Ab
stem(aB) = aB
stem(ab) = ab

Thus when the case option is case-insensitive, applying the case option to the stem would always return &apos;ab&apos; for each of AB, Ab, aB and ab.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>48566</commentid>
    <comment_count>5</comment_count>
    <who name="Mary Holstege">holstege</who>
    <bug_when>2011-05-17 15:29:47 +0000</bug_when>
    <thetext>Made the additions to english-stems.txt.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>49244</commentid>
    <comment_count>6</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2011-06-07 07:43:05 +0000</bug_when>
    <thetext>english-stems.txt is still missing

Test Tests Testing Tested 

which is required for the following tests.

ftstaticcontext-q2
ftstaticcontext-q4
ftstaticcontext-q5
stemming-queries-results-q1	  
stemming-queries-results-q1b	  
xquery-xpath-composability-queries-results-q2</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>50385</commentid>
    <comment_count>7</comment_count>
    <who name="Mary Holstege">holstege</who>
    <bug_when>2011-06-28 04:14:41 +0000</bug_when>
    <thetext>Done, per comment 6.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>50388</commentid>
    <comment_count>8</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2011-06-28 07:28:14 +0000</bug_when>
    <thetext>Confirmed fixed.  Thanks.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>