This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6667 - [FT] Stemming
Summary: [FT] Stemming
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL: http://basex.org
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-03-09 14:23 UTC by Christian Gruen
Modified: 2009-03-30 18:51 UTC (History)
1 user (show)

See Also:


Attachments

Description Christian Gruen 2009-03-09 14:23:43 UTC
Hi,

I have another one, conc. the two stemming queries ft-3.4.7-examples-q1.xq and ft-3.4.7-examples-q3.xq:

  ... "propagation of errors" with stemming ...

The term "propagate" is not contained in the specified stemming dictionary, but if a normal Porter Stemmer is applied, the query will yield results.

Thanks,
Christian, BaseX Team 
http://www.basex.org
Comment 1 Pat Case 2009-03-13 11:53:06 UTC
Hi Christian,

In the end whether propagate stems or not in these queries is not important to convey their meaning. The queries and the expected results are the same. Q1 is true because the word of is a stop word. Q3 fails because no stop words are used and the word of is treated as a word.

But I take your point and have added propagate propagating propagation to the stemming file. If a stemming operator is in the query, it is nice to have one of the words in the query in stemming dictionary.

Please realize though that the stemming dictonary is meant to be brief and practical. It does not attempt to enable stemming on every stemmable word in the source documents.

If this result is acceptable to you, please close this bug.

Pat Case



Comment 2 Christian Gruen 2009-03-13 14:22:21 UTC
Thank you Pat,

just get sure on query "ft-3.4.7-examples-q2.xq" - it performs the following test:

[1] 'propagating few errors' ftcontains "propagation of errors"
     with stemming with stop words ("a", "the", "of")

If the terms "propagation" and "propagating" will not be stemmed, this query should equal the following one:

[2] 'propagating few errors' ftcontains "propagation of errors"
     with stop words ("a", "the", "of")

The text token "few" will be ignored due to stopword removal, but I would still expect the query to yield "false". Do you agree?

In contrast, the following query should yield true:

[3] 'propagate few errors' ftcontains "propagate of errors"
     with stop words ("a", "the", "of")


In my opinion, this query should be simplified in the specification as the stemming option does not really contribute to explain the asymmetry in the stop word semantics; what do you think about the following version?

_________________________________

3.4.7 Stop Word Option

[...]

The following expression returns true, because the document contains the phrase "propagating few errors":

/books/book[@number="1"]//p ftcontains "propagating of errors"
  with stop words ("a", "the", "of") 

Note the asymmetry in the stop word semantics: the property of being a stop word is only relevant to query terms, not to document terms. Hence, it is irrelevant for the above-mentioned match whether "few" is a stop word or not, and on the other hand we do not want the query above to match "propagating" followed by 2 stop words, or even a sequence of 3 stop words in the document.

[...]
_________________________________


I hope I didn't get it completely wrong - sorry for wasting your time otherwise..

Christian, BaseX Team 
http://www.basex.org


Comment 3 Pat Case 2009-03-30 18:19:15 UTC
Yes. I think we agree on the intent.

And yes. I agree that "with stemming" is not needed in stop words examples. I have removed it from ft-3.4.7-examples-q1.xq and ft-3.4.7-examples-q3.xq in the in the test suite and in the language document.

If this is OK with you. please mark this bug closed.

Pat Case
Comment 4 Christian Gruen 2009-03-30 18:51:54 UTC
Thanks! Christian