This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6195 - [FT] TestSuite - StopWords & Thesaurus
Summary: [FT] TestSuite - StopWords & Thesaurus
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL: http://basex.org/
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-10-31 07:06 UTC by Christian Gruen
Modified: 2009-05-02 12:00 UTC (History)
1 user (show)

See Also:


Attachments

Description Christian Gruen 2008-10-31 07:06:11 UTC
Hi,

I would be interested how StopWords and Thesaurus options are to be handled in the XQFT TestSuite. Currently, I have no clue how to evaluate the URLs "http://bstore1.example.com/...." in the existing XQuery files.

One solution might be to use a relative/local path here, but I do not know if this would be supported by all implementations.

Christian, BaseX Team 
http://www.basex.org
Comment 1 Mary Holstege 2008-11-24 20:51:17 UTC
The WG discussed this issue and agreed we need to augment the 
testsuite.  Please note that we have not yet completely implemented
the use of this new system throughout the testsuite. If you are satisfied
with this resolution, please mark the bug as closed.

Please note the following addition to the instructions:
<quote>
Special Sources: Stop Word List, Thesaurus, and Stemming Dictionary
The stopwords, thesaurus, and stemming-dictionary sources are not intended  
to be used directly in the form in which they are given, but to provide  
information to those running the test suite about the expectations a  
particular test has about various implementation-specific aspects of the  
execution context. Implementations are expected to provide equivalent  
information to the query, but in whatever form is appropriate in their  
context. A stopwords source is a plain text file containing list of stop  
words, one per line. When a query references this stop word list, the  
implementation is expected to provide that list of stop words to the  
query. A thesaurus source is an XML document defined against the  
thesaurus.xsd XML Schema. When a query references this thesaurus, the  
implementation is expected to provide equivalent thesaurus information to  
the query. The stemming-dictionary is a plain text file containing lines  
of whitespace-separated tokens. Each token on the line should stem to the  
first token on the line. When the catalog entry for a query references a  
stemming dictionary, the implementation is expected to provide stemming  
equivalent to the rules given in the stemming dictionary.
</quote>

The basic idea is that there are three new kinds of sources:
A stop word list, which is just a text file, one stop word per line;
a thesaurus, which is an XML file as per the schema; and a stemming
dictionary, which is one stem per line. 

The catalog descriptions for stop word lists and thesauri include a URI
that matches up with the one in the query.  This is similar to the
handling of schemas.  The stemming dictionary has no URI: it is the resource
ID that matters and it is used to define the relevant stem equivalents
when it makes a difference for stemmed search.

** Changes to XQFTTSCatalog.xsd/xml:

Add three new kinds of source roles: stopwords, thesaurus, and  
stemming-dictionary, and corresponding elements in the sources part of 
the catalog. Add an aux-URI element to the test-case itself.

Queries that use a URI for a stop words list should have an aux-URI with
role="stopwords"; queries that us a URI for a thesaurus should have an
aux-URI with role="thesaurus".  Queries that rely on particular stemming
behaviour should have an aux-URI with role="stemming-dictionary".

** Examples:

* Stop words:
TestSources/stopwords.txt:
and
the
then
it
of
in

Catalog description:
     <stopwords ID="stopwords1"  
uri="http://bstore1.example.com/StopWordList.xml" FileName="stopwords.txt"
        Creator="Full-Text Task Force">
       <description last-mod="2008-11-10">Stop word list for use  
cases</description>
     </stopwords>

Query description using stopwords
(with stop words at "http://bstore1.example.com/StopWordList.xml"):
     <test-case is-XPath2="true" name="stopwords-1"  
FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTStopWord/"  
scenario="standard" Creator="Full-Text Task Force">
       <description>Example using stop words</description>
       <spec-citation spec="XQueryFullText" section-number="3.4.7"  
section-title="Stop Word Option" section-pointer="ftstopwordoption"/>
       <query name="stopword-1" date="2008-11-10"/>
       <aux-URI role="stopwords">stopwords1</aux-uri>
       <input-file role="principal-data"  
variable="input-context">ftusecases</input-file>
       <output-file role="principal"  
compare="XML">stopwords-1.xml</output-file>
      </test-case>

* Thesaurus: (Schema is TestSources/thesaurus.xsd)

TestSources/soundex.xml:
<thesaurus xmlns="http://www.w3.org/xqftts/thesarus">
   <entry>
     <term>Marigold</term>
     <synonym>
       <term>Merrygould</term>
       <relationship>sounds like</relationship>
     </synonym>
   </entry>
</thesaurus>

Catalog description:
     <thesaurus ID="soundex"  
uri="http://bstore1.example.com/UsabilitySoundex.xml"  
FileName="soundex.txt"
        Creator="Full-Text Task Force">
       <description last-mod="2008-11-10">Soundex thesaurus for  
examples</description>
     </thesaurus>

Query using thesaurus:
(with thesaurus at "http://bstore1.example.com/UsabilitySoundex.xml"):
     <test-case is-XPath2="true" name="thesaurus-1"  
FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTThesaurus/"  
scenario="standard" Creator="Full-Text Task Force">
       <description>Example using stop words</description>
       <spec-citation spec="XQueryFullText" section-number="3.4.3"  
section-title="Thesaurus Option" section-pointer="ftthesaurusoption"/>
       <query name="thesaurus-1" date="2008-11-10"/>
       <aux-URI role="thesaurus">soundex</aux-uri>
       <input-file role="principal-data"  
variable="input-context">ftusecases</input-file>
       <output-file role="principal"  
compare="XML">thesaurus-1.xml</output-file>
      </test-case>

* Stemming
TestSources/english-stems.txt
improve improves improving improved
dog dogs
cat cats
train trains training trained
error errors

Catalog description:
     <stemming-dictionary ID="english-stems" FileName="english-stems.txt"
        Creator="Full-Text Task Force">
       <description last-mod="2008-11-10">English stems</description>
     </stemming-dictionary>

Query using thesaurus:
(with stemming)
     <test-case is-XPath2="true" name="stemming-1"  
FilePath="Expressions/Operators/CompExpr/FTContainsExpr/FTSelection/MatchOptions/FTStemming/"  
scenario="standard" Creator="Full-Text Task Force">
       <description>Example using stemming</description>
       <spec-citation spec="XQueryFullText" section-number="3.4.4"  
section-title="Stemming Option" section-pointer="ftstemoption"/>
       <query name="stemming-1" date="2008-11-10"/>
       <aux-URI role="stemming-dictionary">english</aux-uri>
       <input-file role="principal-data"  
variable="input-context">ftusecases</input-file>
       <output-file role="principal"  
compare="XML">stemming-1.xml</output-file>
      </test-case>


Comment 2 Christian Gruen 2008-11-24 21:38:43 UTC
Mary,

thank you for the detailed presentation and discussion of the proposed test suite extensions. As far as I can judge it, the solution for stop words and thesaurus should completely come up to its expectations. I'm wondering, however, if the stemming options should be defined in the test suite.

In contrast to the stop words and thesaurus option, the currently available version of the XQFT specification does not allow to specify a specific stemming file..

  'x' ftcontains 'y' with stop words at 'z'
  'x' ftcontains 'y' with thesaurus at "z"
  'x' ftcontains 'y' with stemming ...?

This is of course what you are stating (the resource ID is what we are interested in), but, as the specification states that "it is implementation-defined whether the stemming is based on an algorithm, dictionary, or mixed approach", I would prefer to regard the stemming dictionary as an optional choice.

However, these are details; I'm glad to see some more issues solved.

Christian, BaseX Team 
http://www.basex.org