This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 11885 - [XQFTTS] english-stems.txt stemming dictionary
Summary: [XQFTTS] english-stems.txt stemming dictionary
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Proposed Recommendation
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-01-27 08:59 UTC by Tim Mills
Modified: 2011-06-28 07:28 UTC (History)
2 users (show)

See Also:


Attachments

Description Tim Mills 2011-01-27 08:59:00 UTC
The file "english-stems.txt" contains stemming rules only for lower case text.  However, the specification clearly states that the "Stemming Option must be applied before the Case Option and the Diacritics Option".

So when tokenizing the string "Dogs and Cats" with stemming, the okens presented to the tokenizer must be "Dogs", "and", "Cats".

The guidelines for running XQFTTS state that the "stemming-dictionary is a plain text file containing lines of whitespace-separated tokens. Each token on the line should stem to the first token on the line."

Note that it is conceivable that the stemming dictionary might stem "AIDS" to "AIDS" but "aids" to "aid".  This would be a useful test of the order of application of stemming and case options.  Presumably the test suite doesn't currently test this.
Comment 1 Mary Holstege 2011-02-22 19:44:09 UTC
Correct, stemming should be case-sensitive.  I have added two additional tests
ft-matchoptions-q5 and ft-matchoptions-q6 to test this case.  Please indicate your satisfaction with this resolution by closing this bug.
Comment 2 Paul J. Lucas 2011-02-22 20:26:56 UTC
So how would one do case-insensitve stemming?
Comment 3 Tim Mills 2011-02-23 10:18:56 UTC
(In reply to comment #1)
> Correct, stemming should be case-sensitive.  I have added two additional tests
> ft-matchoptions-q5 and ft-matchoptions-q6 to test this case.  Please indicate
> your satisfaction with this resolution by closing this bug.

Thanks.  However, for some tests to pass, the english-stems.txt needs to have the following lines added.


Dog Dogs
Cat Cats
Improve Improves Improving Improved
Test Tests Testing Tested 
Plan Planning
Conduct Conducting
Comment 4 Tim Mills 2011-02-23 10:37:44 UTC
(In reply to comment #2)
> So how would one do case-insensitve stemming?

Assuming that

lowercase(AB) = ab
lowercase(Ab) = ab
lowercase(aB) = ab
lowercase(ab) = ab

one would ensure that if the implementation's stemming algorithm was such that

stem(AB) = AB

then 

stem(Ab) = Ab
stem(aB) = aB
stem(ab) = ab

Thus when the case option is case-insensitive, applying the case option to the stem would always return 'ab' for each of AB, Ab, aB and ab.
Comment 5 Mary Holstege 2011-05-17 15:29:47 UTC
Made the additions to english-stems.txt.
Comment 6 Tim Mills 2011-06-07 07:43:05 UTC
english-stems.txt is still missing

Test Tests Testing Tested 

which is required for the following tests.

ftstaticcontext-q2
ftstaticcontext-q4
ftstaticcontext-q5
stemming-queries-results-q1	  
stemming-queries-results-q1b	  
xquery-xpath-composability-queries-results-q2
Comment 7 Mary Holstege 2011-06-28 04:14:41 UTC
Done, per comment 6.
Comment 8 Tim Mills 2011-06-28 07:28:14 UTC
Confirmed fixed.  Thanks.