12109 – [FT] StopWord Option

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12109 - [FT] StopWord Option

Summary: [FT] StopWord Option

Status:	REOPENED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Proposed Recommendation
Hardware:	PC Windows NT

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jim Melton
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-02-17 14:27 UTC by Tim Mills
Modified:	2011-09-22 09:04 UTC (History)
CC List:	1 user (show)

See Also:

Attachments

Description Tim Mills 2011-02-17 14:27:01 UTC

There is very little information given regarding how stop words work except as part of phrases or in the context of FTWindow/FTDistance.

In information retrieval system based upon inverted indices, it is traditional to use stop words to remove high frequency terms from the index to reduce the size of the inverted index.  It is also traditional to ignore stop words during query processing to improve query performance (both speed and precision).

The text:

"Some implementations may apply stop word lists during indexing and be unable to comply with query-time requests to not apply those stop words."

implies that XQuery Full Text is amenable to the approach of inverted indices with stop words stripped at index time.

Consider the query:

declase ft-option using stop words ("be", "not", "or", "to");

"to be or not to be" contains text "to"

According to the specification

"Stop words are tokens in the query that match any token in the text being searched"

This seems to suggest that the result should be identical to

"to be or not to be" contains text ".+" using wildcards

Since "to be or not to be" is entirely composed of stop words, any application of stop word lists during indexing means that it contains no tokens and thus the result would be "false" rather than "true".

Comment 1 Mary Holstege 2011-03-01 18:18:19 UTC

The WG discussed this at the F2F 2011-02-28.  Since stop word handling in the content is a property of the tokenization and that depends on the implementation,
there are reasonable implementations that return "true" or "false" for this 
query, depending on whether they tokenize by removing stop words entirely or by 
replacing them with a "wildcard" token of some sort.

If you are satisified with this resolution, please mark the bug as CLOSED.

Comment 2 Tim Mills 2011-04-04 12:38:27 UTC

While I'm happy with the response, I still think the text should make this explicit.

There should at least be one example using stopwords not sandwiched between non-stopwords.

Our implementation ignores stopwords at query time (i.e. they are handled in our implementation of matchTokenInfos).  Where S is a stopword and C is a non-stopwor, we treat:

S+ as a query matching no results.
S* T+ S* as if the query were T* (i.e. strip head and teal stopwords)
T+ S+ T+ as if the query were T+ .+ (for each S) T+

This interpretation allows us to pass the XQFTTS tests.

Comment 3 Tim Mills 2011-09-22 09:04:46 UTC

By way of further explanation, the following is motivated by the conviction that XQuery Full Text should be able to be implemented using a traditional full text inverted index.

The examples given below have the form

$arg contains text "STOPWORD" using stop words ("STOPWORD") 

but can be interpreted as:

count(fts:tokenize($arg)) >= 1

where fts:tokenize is an implementation-defined function which returns a sequence of element(TokenInfo) resulting from tokenization of its arguments.

This illustrates how the stop word feature can be abused to do things which have little to do with stop word handling.  I'd argue this is a Bad Thing, and it results from the specification saying that "Stop words are tokens in the query that match any token in the text being searched."  I'd argue that this is quite different from how stop words are generally used in Information Retrieval, where stop words are typically discarded (either at query time or index time).


EZAMPLE 1
---------

It is implementation defined whether the following expression will return true or false.

"not" contains text "and" using stop words ("and", "not") 

EXPLANATION
-----------

Tokenization of "not" using the tokenization rules used for examples will result in a single token ("not").

The expression will return true if the implementation matches the stop word 'and' against the single token "not".  (Performing such a match with an inverted index is inefficient.)

The expression will return false if the implementation has discarded stop words during indexing (as permitted by the note in section 3.4.7).


EZAMPLE 2
---------

I can't find an argument in the specification to support returning false for the fallowing query.

"strange" contains text "and" using stop words ("and") 

However, I believe this to be a problem with the specification.  It is quite normal for an IR system to return false for such a query.

EXPLANATION
-----------

Tokenization of "strange" using the tokenization rules used for examples will result in a single token ("strange").

The expression will return true if the implementation matches the stopword 'and' against the single token "strange".