This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6809 - [FT] Test Suite - Thesaurus Queries
Summary: [FT] Test Suite - Thesaurus Queries
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: All All
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-04-14 01:43 UTC by Christian Gruen
Modified: 2009-05-04 12:43 UTC (History)
2 users (show)

See Also:


Attachments

Description Christian Gruen 2009-04-14 01:43:32 UTC
Dear task force,

I decided to add a basic Thesaurus implementation to BaseX to support and test the remaining queries. I frankly admit that I'm no Thesaurus expert at all, so I mainly focused on the hints in the specification and the existing tests. As I'm not sure if I completely understood what's going on in the test examples, here are some more questions/bug indications:


[1] ft-3.4.3-examples-q1

The usability.xml thesaurus file returns the synonym "tasks" for the query input "duties" - but the queried document node includes only the word in singular ("task" instead of "tasks"). Is this intended?


[2] ft-3.4.3-examples-q2

The thesaurus offers the terms "navigation", "layout" and "terminology" for the query phrase "web site components", but all of the terms are not included in the tested document node.


[3] ft-3.4.3-examples-q3.xq

In this query, words similar to "Merrygould" are to be found. As "case insensitive" is the default options, the term is converted to "merrygould" in my tests - so the thesaurus doesn't return any result.


[4] Probably a naïve question: do all thesaurus entries work in a "bidirectional" way? I.e., if "A" is a synonym for "B", do I get "A" if I look for "B", and "B" if I look for "A"? Next to that, are all synonym bidirectional? One could argue that "Marigold" sounds like "Merrygould", but "Merrygould" doesn't sound like "Marigold". In the latter case, the upper query [3] would only return results in the direction opposite to the current one.


[5] ft-3.4.3-expressions-q3

The thesaurus returns "software" for the term "program"; this term seems to be included in two books (number 1 and 3), but the current result contains only book 1.


[6] ft-3.4.3-expressions-q5

..references the missing file "TechnicalThesaurus.xml".


[7] ft-3.4.3-expressions-q6
	
parentheses missing before "default" and after "NT". I guess that the Thesaurus should also accept the original query terms and not only synonyms; is this correct? If "yes", then book number 3 should be added as result, as it contains the term "Computers".


[8] thesaurus-queries-results-q2 / q2b

As the used relationship is "narrower terms" here (instead of "NT" or "narrower term") - do you expect implementations to recognize all kinds of writings, or ?


[9] thesaurus-queries-results-q5 / q5b / q6 / q6b

"spellcheck.xml" and "OurTaxonomy.xml" don't exist yet.


[10] full-text-composability-queries-results-q2b

Parsing issue: "]" missing after "stemming"


[11] full-text-composability-queries-results-q3 / q3b

Parsing issue: some opening and closing parentheses are missing.



I'm currently running the Thesaurus as the last match option, as I saw that the execution order of match options seems to be implementation defined. It may well be that different orders could result in different results - but I haven't really thought this through.

Concluding, as I indicated in the beginning, my knowledge on Thesauri is very limited. So maybe it will be helpful to directly talk to one of you in near future to get more insight in some of the open issues..

Christian
Comment 1 Christian Gruen 2009-04-15 04:15:49 UTC
A little update: please forget my question [4] concerning synonyms. All recommended relationships from the specification are now implemented in a bidirectional way (except "TT"); if unknown relationships such as "sounds like" are encountered, they are stored in a unidirectionally. This is why I guess that most implementations would probably benefit if the thesaurus file "soundex.xml" was rewritten from..

  <entry>
    <term>Marigold</term>
    <synonym>
      <term>Merrygould</term>
      <relationship>sounds like</relationship>
    </synonym>
  </entry>

..to..

  <entry>
    <term>Merrygould</term>
    <synonym>
      <term>Marigold</term>
      <relationship>sounds like</relationship>
    </synonym>
  </entry>

I'm still interested to hear your opinion about the remaining topics!
Christian
Comment 2 Michael Dyck 2009-04-15 18:51:03 UTC
(In reply to comment #0)
>
> [10] full-text-composability-queries-results-q2b
> 
> Parsing issue: "]" missing after "stemming"

Fixed!
Comment 3 Michael Dyck 2009-04-15 20:09:38 UTC
(In reply to comment #0)
> 
> [11] full-text-composability-queries-results-q3 / q3b
> 
> Parsing issue: some opening and closing parentheses are missing.

Fixed!
Comment 4 Pat Case 2009-04-16 17:03:31 UTC
Hi Christian.

Some responses follow:

[1] ft-3.4.3-examples-q1

The usability.xml thesaurus file returns the synonym "tasks" for the query
input "duties" - but the queried document node includes only the word in
singular ("task" instead of "tasks"). Is this intended?

--Yes. I didn't notice that example in the language document was wrong when I copied it into the test suite. I have changed duties to duty in the query,  added duty and task as synonyms in the thesaurus, and updated the description in the catalog to fix it in the test suite. I corrected the query and the description in the language document. I did not build the language document.


[2] ft-3.4.3-examples-q2

The thesaurus offers the terms "navigation", "layout" and "terminology" for the
query phrase "web site components", but all of the terms are not included in
the tested document node.

--Sigh. This example does not work against the sample document in the language document. It works against the sample document in the use cases. So I reworked it to work against the sample document in the language document. I searched on people and set users up as an NT for people in the usability thesaurus. Fixed in both the test suite and the language document.


[3] ft-3.4.3-examples-q3.xq

In this query, words similar to "Merrygould" are to be found. As "case
insensitive" is the default options, the term is converted to "merrygould" in
my tests - so the thesaurus doesn't return any result.

--Realizing now that case insensitive does not mean lower case, we search on "Merrygould". "Merrygould" and "Marigold" are in the thesaurus. "Marigold" is found in the sample document, so I am at loss as to why we are talking about case at all in this query. I don't understand how Mary's comments apply to this one. I have made no changes.


[5] ft-3.4.3-expressions-q3

The thesaurus returns "software" for the term "program"; this term seems to be
included in two books (number 1 and 3), but the current result contains only
book 1.

--So true. I added Bk 3 to the result.


[6] ft-3.4.3-expressions-q5

..references the missing file "TechnicalThesaurus.xml".

--My bad again. I corrected the thesaurus name to UsabilityThesaurus.xml.


[7] ft-3.4.3-expressions-q6

parentheses missing before "default" and after "NT". I guess that the Thesaurus
should also accept the original query terms and not only synonyms; is this
correct? If "yes", then book number 3 should be added as result, as it contains
the term "Computers".

--Yes. Added the parentheses to the query. Added Bk 3 to the results. I also changed the operator in the query from ftor to ftand, otherwise since program is nowhere in the sample document, there would be no result at all.


[8] thesaurus-queries-results-q2 / q2b

As the used relationship is "narrower terms" here (instead of "NT" or "narrower
term") - do you expect implementations to recognize all kinds of writings, or ?

--Ouch. That probably was a bit rude of me. I have duplicated the entry in the thesaurus and made the relationships in the second copy "narrower terms", so that no translation from NT to narrower terms is required. 

[9] thesaurus-queries-results-q5 / q5b / q6 / q6b

"spellcheck.xml" and "OurTaxonomy.xml" don't exist yet.

--I added the 2 thesauri.

Again, many thanks Christian for pointing these out so I could correct them.

Please let me know what you think and if you think these responses combined with Michael D's are adequate, please close the bug.

Pat Case
Comment 5 Christian Gruen 2009-04-16 18:33:04 UTC
Hi Pat,

you are welcome. I checked the queries once more. Before closing this bug, it would be great if you could have another look at the following issues:


[3] ft-3.4.3-examples-q3.xq

It's good to hear your opinion on this query, as I surely had quite an implementation-centered approach in my mind here. As I feel that this issue is more complicated as I thought first, I'll add an extra "bug" to discuss the relationship between Thesaurus and match options.

Considering the relationship between "Merrygould" and "Marigold", I would indeed expect the "soundex.xml" file to be modified. This was my suggestion..

OLD:
  <term>Marigold</term>
  <synonym>
    <term>Merrygould</term>
    <relationship>sounds like</relationship>
  </synonym>

NEW:
  <term>Merrygould</term>
  <synonym>
    <term>Marigold</term>
    <relationship>sounds like</relationship>
  </synonym>

If I process a thesaurus request, I look up the input word (Merrygould) and return all words that are linked with the "sounds like" relationship to this term. I have no access to the complete ISO 2788 standard, but, as far as I know, the "sounds like" relationship is not defined there. So an XQuery implementation has to "guess" how a "unknown" relationship like this one works. I treat all undefined relationships as unidirectional, i.e. I will currently return "Merrygould" for the input term "Marigold" - but not the other way round. If the xml file will be modified as proposed above, the relationship can be consistently answered like the other thesaurus examples.

If you have a different opinion or think I'm wrong, don't hesitate to tell me.


[5] ft-3.4.3-expressions-q3

Now, result should be defined as "Fragment" in XQFTCatalog.xml..


[9] thesaurus-queries-results-q5 / q5b / q6 / q6b

Different spellings: "misspelling-of" vs "misspelling of"..


Christian

Comment 6 Pat Case 2009-04-16 19:27:28 UTC
Christian,

[3] ft-3.4.3-examples-q3.xq

It's good to hear your opinion on this query, as I surely had quite an
implementation-centered approach in my mind here. As I feel that this issue is
more complicated as I thought first, I'll add an extra "bug" to discuss the
relationship between Thesaurus and match options.

Considering the relationship between "Merrygould" and "Marigold", I would
indeed expect the "soundex.xml" file to be modified. This was my suggestion..

OLD:
  <term>Marigold</term>
  <synonym>
    <term>Merrygould</term>
    <relationship>sounds like</relationship>
  </synonym>

NEW:
  <term>Merrygould</term>
  <synonym>
    <term>Marigold</term>
    <relationship>sounds like</relationship>
  </synonym>

If I process a thesaurus request, I look up the input word (Merrygould) and
return all words that are linked with the "sounds like" relationship to this
term. I have no access to the complete ISO 2788 standard, but, as far as I
know, the "sounds like" relationship is not defined there. So an XQuery
implementation has to "guess" how a "unknown" relationship like this one works.
I treat all undefined relationships as unidirectional, i.e. I will currently
return "Merrygould" for the input term "Marigold" - but not the other way
round. If the xml file will be modified as proposed above, the relationship can
be consistently answered like the other thesaurus examples.

If you have a different opinion or think I'm wrong, don't hesitate to tell me.

--I see sounds like as a two way equivalency similar to synonym, but I don't claim to know how the thesaurus should be structured either. So to get this solved, I have put both entries in the thesaurus. Hope that is OK.


[5] ft-3.4.3-expressions-q3

Now, result should be defined as "Fragment" in XQFTCatalog.xml..

--Done. 

[9] thesaurus-queries-results-q5 / q5b / q6 / q6b

Different spellings: "misspelling-of" vs "misspelling of"..

--I looked in the queries, the spellcheck thesaurus, and the use cases and don't see any hyphenated versions. Where are you looking?

Pat
Comment 7 Christian Gruen 2009-04-16 20:29:29 UTC
..continued..


[3] ft-3.4.3-examples-q3.xq

--I see sounds like as a two way equivalency similar to synonym, but I don't
claim to know how the thesaurus should be structured either. So to get this
solved, I have put both entries in the thesaurus. Hope that is OK.

Yes, this is fine as well. A minor issue: it should be rewritten from 

    <term>Merrygould</term>
    <synonym>
      <term></term>
      <relationship>Marigold</relationship>
    </synonym>

...to...

    <term>Merrygould</term>
    <synonym>
      <term>Marigold</term>
      <relationship>sounds like</relationship>
    </synonym>


[9] thesaurus-queries-results-q5 / q5b / q6 / q6b

--I looked in the queries, the spellcheck thesaurus, and the use cases and
don't see any hyphenated versions. Where are you looking?

Sorry, I mixed this one up. The hyphenated version is used in the "usability.xml" - but it's used nowhere.

Instead, I would suggest to extend the "spellcheck.xml" file similar to the "soundex.xml" file; otherwise the logics of this thesaurus is opposite to the other ones. An example..


a) "users" ftcontains "people" with thesaurus at "usability.xml"
     relationship "NT"

  <term>people</term>
  <synonym>
    <term>users</term>
    <relationship>NT</relationship>
  </synonym>

  -> true


b) "succesful" ftcontains "sucessfull" with thesaurus at "spellcheck.xml"
     relationship "misspelling of"

  <term>successful</term>
  <synonym>
    <term>sucessfull</term>
    <relationship>misspelling of</relationship>
  </synonym>

  -> false...


Christian

Comment 8 Pat Case 2009-05-04 12:16:50 UTC
Christian,

[3] ft-3.4.3-examples-q3.xq

Yes, this is fine as well. A minor issue: it should be rewritten from 

    <term>Merrygould</term>
    <synonym>
      <term></term>
      <relationship>Marigold</relationship>
    </synonym>

...to...

    <term>Merrygould</term>
    <synonym>
      <term>Marigold</term>
      <relationship>sounds like</relationship>
    </synonym>

--Done.

[9] thesaurus-queries-results-q5 / q5b / q6 / q6b

Instead, I would suggest to extend the "spellcheck.xml" file similar to the
"soundex.xml" file; otherwise the logics of this thesaurus is opposite to the
other ones. An example..

a) "users" ftcontains "people" with thesaurus at "usability.xml"
     relationship "NT"

  <term>people</term>
  <synonym>
    <term>users</term>
    <relationship>NT</relationship>
  </synonym>

  -> true


b) "succesful" ftcontains "sucessfull" with thesaurus at "spellcheck.xml"
     relationship "misspelling of"

  <term>successful</term>
  <synonym>
    <term>sucessfull</term>
    <relationship>misspelling of</relationship>
  </synonym>

  -> false...


--Done.

Pat Case
Comment 9 Christian Gruen 2009-05-04 12:43:44 UTC
Thanks; as far as I can see, all todos are fixed, so I closed this one.