6469 – [FT] TestSuite issues

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6469 - [FT] TestSuite issues

Summary: [FT] TestSuite issues

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Full Text 1.0 (show other bugs)
Version:	Candidate Recommendation
Hardware:	All All

Importance:	P2 normal
Target Milestone:	---
Assignee:	Jim Melton
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:	http://basex.org
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-01-25 04:06 UTC by Christian Gruen
Modified:	2009-02-26 21:24 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Christian Gruen 2009-01-25 04:06:41 UTC

Hi Jim, hi all,

in the following you find some new comments on the XQFT test suite (updated on Jan 24). Once more I'm sorry that I didn't have time to put each bug in a single thread, but feel free to split the list. Here we go..


[1] ft-3.2-examples-q5.xq:

As the scoring algorithm is implementation defined, results will vary. A lower scoring value than 0.8 might cover more implementations; alternatively, queries with scoring could be to a minimum, as long as no reference scoring model is offered (otherwise these test queries won't make too much sense).


[2] ft-3.3-examples-q1.xq:

Attribute value should be atomized to conform with the proposed result, e.g. like this:

old: //book[. ftcontains "usability" occurs at least 2 times]/@number
new: data(//book[. ftcontains "usability" occurs at least 2 times]/@number)


[3] ft-3.4-examples-q1.xq:

Query yields a boolean, whereas the result contains elements


[4] ftstaticcontext-results-q1.txt:

Result file is incorrect (superfluous: line 377-end)


[5] ftstaticcontext-results-q2.txt:

Result file is incorrect (superfluous: line 280-end)


[6] ftstaticcontext-results-q6.txt:

All books expected as result ('approved'/'approve' -> 'approv')


[7] FTPrimary-FTWords-any-q4b.xq:

0 results expected (<paragraphs/>), as the matching elements don't contain any of the phrases ("FTAnyallOption weekend" and "voting specifies").


[8] FTPrimary-FTWords-anyword-q4b.xq

9 results expected inside the <paragraphs> element (XQFT specs, 3.2: "[...] the tokens from all the strings are combined into a single set. [...]").


[9] FTPrimary-FTWords-anyword-q2b_result.xml

Space character missing at line 9, col. 206


[10] FTWords/*.xml

..Numerous single spaces missing in the result files (alternatively, the input document should be modified).


[11] FTPrimary-FTWords-phrase-q3a.xq - -q4a.xq

Refering to the specs ("the tokens from all the strings are concatenated in a single sequence, which is considered as a phrase"), I wouldn't expect results for "{ "FTAnyallOption", "containment" } phrase". Instead "{ "how", "containment" } phrase" would yield a result.


[12] FTOr-badexpr1.xq

I was wondering that "FTOr with empty sequence" is not allowed; why is that? Where is it stated in the specification?


[13] FTNot-q1.xq

This one returns all <book> elements in which the <para> element does not contain the word "Ninja". In the result, even the books are listed which have no <para> element; I would only expect 4 instead of 9 titles as result.


[15] FTNot-q2.xq / -q3.xq / -q4.xq / -q5.xq

"<title>No Bad Software</title>" is missing in the result


[16] FTMildNot-or1.xq

"<title>Ninja Coder</title>" is missing as its <para> element contains the word "usability"


[16] FTMildNot-or2.xq

Here I would also expect "<title>Ninja Coder</title>" in the result as ("usability" not in "ninja coder") yields true. If I should be wrong, please tell me why..


[17] FTMildNot-not1.xq

I'd propose "<title>No Bad Software</title>" and "<title>Ninja Coder</title>" as result; the proposed result would be correct without "ftnot" in the query.


[18] ftwildcard-q2.xq

<book number="2"> is missing


[19] ftwildcard-q5.xq

<book number="1"> expected as result (as it contains the word "task")


[20] FTOrder-q3.xq / FTOrder-q4.xq

should be swapped, otherwise results are wrong


[21] FTWindow-words1.xq

I wouldn't expect results here, as "physical" ftand "swift" have two words in between: "swift application of physical". So isn't this a window of 4 (instead of 2) words?


[22] FTWindow-complexwords2.xq

Why does "window 0 words" return results? Same for other queries in this section: can a window of 1 word really contain two words?


[23] FTScope-q2.xq / FTScope-q4.xq

If I got it right, each <p> element will be evaluated separately in this query, yielding 0 results:

  "This is a simple example." ftcontains ("simple" ftand "complex") different sentence -> false
  "It is not complex." ftcontains ("simple" ftand "complex") different sentence -> false

But the following query would in fact return true (assuming that "." is a implementation-defined sentence delimiter):

  <p>This is a simple example. It is not complex.</p>
      ftcontains ("simple" ftand "complex") different sentence


[24] FTContent-q2.xq

...will return all books except for <title>The Blues</title> (otherwise, ftnot must be removed)


[25] FTContent-and1.xq

0 results expected, as the query does only cover the first and last tokens, but not all that occur: "The secret" ftand "nice" entire content


[26] FTNot-unconstrained-q1.xq

Only 4 results expected (see [14])


[27] FTNot-unconstrained-q3.xq / -q5.xq

<title>No Bad Software</title> missing in result.


[28] FTSelection-FTTimes-...

xlink attributes and namespaces missing (<nt xmlns:xlink="http://www.w3.org/1999/xlink" xlink:type="simple"...>)


[29] ft-3.4.5-examples-q1.xq

true expected (or lowercase option missing)


This time, I've just looked at ~25% of the tests; seems that appr. one third of the tests are somehow buggy.. IMHO, debugging would be much faster if the queries and results were shorter and focused on the actual problem. In many cases, simple string tests (à la 'A' ftcontains 'A', etc.) could suffice to check potential implementation bugs. Anyway, enough smart suggestions..


Thanks for your efforts!

Christian, BaseX Team 
http://www.basex.org

Comment 1 Pat Case 2009-02-04 21:18:35 UTC

Christian,

I can address a few of these:

[3] ft-3.4-examples-q1.xq:

Query yields a boolean, whereas the result contains elements

--Yes. Fixed, this one and one that follows. The results now contain true.

[4] ftstaticcontext-results-q1.txt:

Result file is incorrect (superfluous: line 377-end)

--Removed. Static context queries will be replaced.

[5] ftstaticcontext-results-q2.txt:

Result file is incorrect (superfluous: line 280-end)

--Removed. Static context queries will be replaced.

[6] ftstaticcontext-results-q6.txt:

All books expected as result ('approved'/'approve' -> 'approv')

--Removed. Static context queries will be replaced.

[18] ftwildcard-q2.xq

<book number="2"> is missing

--This query looks for "site." site followed a period (not an indicator). There is a sites. The title ends in site but does not end in a period. I don't see "site." in Book 2.

--The language document says: The "without wildcards" option finds tokens without recognizing wildcard indicators and qualifiers. Periods, question marks, asterisks, plus signs, and two numbers separated by a comma, both enclosed by curly braces, are always recognized as ordinary text characters.

--Might you have been thinking the period was an indicator or I am missing your point?

[19] ftwildcard-q5.xq

<book number="1"> expected as result (as it contains the word "task")

--I was clearly confused about what word I was searching on, so I changed the word in query to match the description to "site.?" That word is not in any book for the same reasons described above, unless again I am missing oyur point.

[29] ft-3.4.5-examples-q1.xq

true expected (or lowercase option missing)

--Yes, indeed the lowercase option was missing. I added it. 

As always thanks to you Christian.

Pat

Comment 2 Christian Gruen 2009-02-04 21:48:08 UTC

Dear Pat,

thanks a lot for the prompt reply! As usual, just a small comment..

[18] ftwildcard-q2.xq
[19] ftwildcard-q5.xq

> --This query looks for "site." site followed a period (not an indicator). 
> There is a sites. The title ends in site but does not end in a period. I 
> don't see "site." in Book 2. [...]

This is an interesting point for discussion. I had another look into the XQFT Tokenization section (4.1). If I get it right, the tokenizer won't care about characters which are not part of tokens; so I would expect the two following queries to return true:

  'a b' ftcontains 'a.b' 
  'a.b' ftcontains 'a b' 

This is why I would expect the dot in "site." to be ignored in the default (without wildcards) mode. - Please tell me if I got something wrong.

Thanks again,

Christian, BaseX Team 
http://www.basex.org

Comment 3 Jim Melton 2009-02-06 18:46:15 UTC

Christian, I think I disagree with your argument in comment 2 (http://www.w3.org/Bugs/Public/show_bug.cgi?id=6469#c2).  There, you said: 

>This is an interesting point for discussion. I had another look into the XQFT
>Tokenization section (4.1). If I get it right, the tokenizer won't care about
>characters which are not part of tokens; so I would expect the two following
>queries to return true:

>  'a b' ftcontains 'a.b' 
>  'a.b' ftcontains 'a b' 

I have just finished re-reading all of section 4.1 very carefully and I didn't find any suggestion that punctuation was never (part of) a token.  In the first of your two examples (quoted above), the first search context 'a b' has, I believe, two tokens, 'a' and 'b'. There, I think we are in agreement.  However, the search pattern 'a.b' has, I think, three tokens, 'a', '.', and 'b'.  (Caution: tokenization is implementation-defined and I'm presuming a tokenizer with certain behaviors that I believe are common in western-world language products.)  Because there is no token corresponding to '.' in the search context, I believe that query must return false. 

Similarly, in the second example, the search context has three tokens and the search pattern has two.  Because the search context does not have two tokens 'a' and 'b' that are adjacent, I believe that this query must also return false.  However, if the query had been written:
   'a.b' ftcontains 'a' ftand 'b'
then it would of course return true. 

If I'm wrong, please correct my analysis!

P.S., I'm working today on the remaining issues in your bug report to resolve the ones I can and assign the others.

Comment 4 Christian Gruen 2009-02-06 19:07:23 UTC

Hi Jim,

thank you for commenting the tokenization issue. If I get it right, tokenization of punctuation eventually is implementation dependent as well, so - as long as it is not specified if punctuation is to be treated as own token - the two discussed queries..

  'a b' ftcontains 'a.b' 
  'a.b' ftcontains 'a b'

..can either return true or false. Concerning the examples in 4.1.1, punctuation is ignored in the tokenization process - or, to put it differently, space and punctuation is treated the same way here:

  "Ford Mustang 2000, 65K, excellent [...]"
  -> Ford(1) Mustang(2) 2000(3), 65K(4), excellent(5)

So the following queries..

  "Ford Mustang 2000, 65K" ftcontains "2000 65K"
  "Ford Mustang 2000 65K" ftcontains "2000, 65K"

..should return true for these examples. What do you think?

Christian, BaseX Team 
http://www.basex.org

Comment 5 Christian Gruen 2009-02-06 19:16:20 UTC

..another question (I hope I didn't overlook this point in the specification): can spaces be handled as tokens as well? In this case, we would run into trouble with many position filters in the test suite..

 'A B' ftcontains ('A' ftand 'B') distance at most 0 words
  -> true/false?

Comment 6 Jim Melton 2009-02-06 23:13:02 UTC

Christian, I think that I still disagree. The only way that "a.b" could contain "a b" or vice versa is if the tokenizer recognized "." as a token separator and not as a token. Because tokenization is so completely implementation-defined, anything is possible. Maybe Pat or Mary will have some good ideas about this, 'cause I have too little real-world Full Text experience to be very certain. 

W.R.T. your question about whether spaces can be recognized as tokens: Again, because tokenization is so completely implementation-defined, it's possible. However, I do not believe that a tokenizer that did that would survive in the marketplace, so I don't believe we need to accommodate that possibility. 

Therefore, your query:
   'A B' ftcontains ('A' ftand 'B') distance at most 0 words
would, IMHO, always return true.

Comment 7 Jim Melton 2009-02-06 23:18:52 UTC

In this comment, I'm going to respond to some more of your bug reports:

[1] ft-3.2-examples-q5.xq:

You're right about having only a minimum of queries with scoring, because it's
very difficult to predict what the results might be.  The XML Query WG and XSL
WG have a sort of policy that requires us to place every example in the spec
into the test suite, and this is one of those.  I'm not entirely sure how to
resolve this dilemma other than to either (1) change the comparison specified
in the catalog to "inspect" or (2) add a second possible result of "no file at
all".  The TF will discuss this issue. 


[2] ft-3.3-examples-q1.xq:
You're right. This should be changed in the spec and in the test. 


[7] FTPrimary-FTWords-any-q4b.xq:
You are obviously correct.  Fixed.  (Note that this result had all 9 paragraphs
and should have had none, but the result of the next test you cite had no
paragraphs and should have had all 9.  I suspect the result files got
reversed.) 


[8] FTPrimary-FTWords-anyword-q4b.xq
Again, obviously correct. Fixed.  (Note that this result had no paragraphs and
should have had all 9, but the result of the previous test you cite had all 9
paragraphs and should have had none.  I suspect the result files got reversed.)



[9] FTPrimary-FTWords-anyword-q2b_result.xml
Good catch! Fixed. 


[10] FTWords/*.xml
I appreciate the comment, but it would be extremely helpful if you could
identify the specific locations. 


[11] FTPrimary-FTWords-phrase-q3a.xq - -q4a.xq
I agree.  I've fixed the problems in each of the tests in slightly different
ways. 
I also think that FTPrimary-FTWords-phrase-q4b.xq had the wrong result; I
believe that its result file should contain only an empty <paragraphs> element.
Do you agree?


[12] FTOr-badexpr1.xq
Good question; the TF will discuss this. 


[13] FTNot-q1.xq
I think I agree with you, but the TF should discuss this.  Of course, the
question is 'Does book[para ftcontains ftnot "Ninja"] mean "find all books that
have a para element that doesn't contain 'Ninja'" or "Find all books that do
not have a para element that contains 'Ninja'"?'  You believe that it's the
former, and I think you're probably right. 


[15] FTNot-q2.xq / -q3.xq / -q4.xq / -q5.xq
I agree; I have fixed q2 and q3, but will ask the test author to fix q4 and q5
(because the result will change from .xml to .txt and the comparison from XML
to Fragment). 


And, unfortunately, that's all I have time to address today. With luck, other
TF members will be able to work on the remaining issues soon.

Comment 8 Christian Gruen 2009-02-07 00:09:11 UTC

Hi Jim,

thanks again for the quick reply, and sorry for persisting on the tokenizer issue. Yes, I'll be interested what the other TF members say about this as I'm still not sure how to interpret the examples in 4.1.1. As stated before, using a Western tokenizer and assuming that commas and periods are treated the same way, I would expect all of the following queries to yield true (otherwise, the position information, which is shown in parentheses in this section, would not make much sense):

   "Ford Mustang 2000, 65K" ftcontains "2000 65K"
   "Ford Mustang 2000, 65K" ftcontains "2000.65K"
   "Ford Mustang 2000  65K" ftcontains "2000.65K"

Concerning #10 and #11:

> [10] FTWords/*.xml
> I appreciate the comment, but it would be extremely helpful
> if you could identify the specific locations. 

Hm, there might have been around 10 files with missing spaces. You could try search and replace in this directory for the following strings, I hope that the three replacements will cover all cases:

 "checked.<"        ->  "checked. <"
 "ue</nt>is"        ->  "ue</nt> is"
 "on</nt>produces"  ->  "on</nt> produces"

> [11] FTPrimary-FTWords-phrase-q3a.xq - -q4a.xq
> I agree.  I've fixed the problems in each of the tests in
> slightly different ways. 
> I also think that FTPrimary-FTWords-phrase-q4b.xq had the wrong result;
> I believe that its result file should contain only an empty <paragraphs>
> element. Do you agree?

In my version from Jan 24, the result was already an empty <paragraphs> element; but I haven't checked the latest version.

Christian, BaseX Team 
http://www.basex.org

Comment 9 Pat Case 2009-02-16 17:30:50 UTC

Hi Christian.

Re:
[18] ftwildcard-q2.xq
[19] ftwildcard-q5.xq

> --This query looks for "site." site followed a period (not an indicator). 
> There is a sites. The title ends in site but does not end in a period. I 
> don't see "site." in Book 2. [...]

This is an interesting point for discussion. I had another look into the XQFT
Tokenization section (4.1). If I get it right, the tokenizer won't care about
characters which are not part of tokens; so I would expect the two following
queries to return true:

  'a b' ftcontains 'a.b' 
  'a.b' ftcontains 'a b' 

This is why I would expect the dot in "site." to be ignored in the default
(without wildcards) mode. - Please tell me if I got something wrong.

--I see your point. I am an end user and I sometimes project my preferences where they don't belong. I would love to have search engine that can find special characters, in this case to allow me to fund "site." as a token. But you are absolutely correct that depending on the tokenization, finding "site" would be just as valid, and that would return Books 1 and 2. So I have added a qualifier to the descriptions, changed the comparator values to Inspect, and added second output files to q2 and q5.

Thanks again for your comments. It is very important to get this right.

Pat

Comment 10 Pat Case 2009-02-16 20:06:59 UTC

Christian,

We addressed one more item during the FTTF telecon today:

[13] FTNot-q1.xq

This one returns all <book> elements in which the <para> element does not
contain the word "Ninja". In the result, even the books are listed which have
no <para> element; I would only expect 4 instead of 9 titles as result.

--The Task Force agreed with your comment and added 4 to the expected result.

Pat

Comment 11 Mary Holstege 2009-02-26 19:14:23 UTC

[12] Empty sequence is allowed, and produces no matches.  Added a different test case to test the case where the expression returns something that can't be treated as xs:string*.

[15] Correct.

[16](a,b) Correct. Fixed results.

[17] Correct. Fixed results.

[20] Correct. Swapped queries.

[21]/[22] Correct, confused with distance. Fixed math.

[23] Fixed queries: intended to have single scope.

[24] Fixed result.

[25] Result already fixed.

[26], [27] Result fixed.

Comment 12 Mary Holstege 2009-02-26 21:16:24 UTC

Christian, I believe this resolves everything you reported in this bug. If you agree, could you please close this bug. Thank you.

Comment 13 Christian Gruen 2009-02-26 21:24:32 UTC

Dear Pat, thanks for your comments. I'll close the bug (issue [28], however, is still open). - Christian