This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 6304 - [FT] attribute value tokenization
Summary: [FT] attribute value tokenization
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Candidate Recommendation
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL: http://www.w3.org/TR/xpath-full-text-10
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-12-12 08:38 UTC by Petr Pleshachkov
Modified: 2011-01-06 15:44 UTC (History)
2 users (show)

See Also:


Attachments

Description Petr Pleshachkov 2008-12-12 08:38:13 UTC
Dear authors,

It looks like the example from 3.6.2 Window Selection is not correct.


"The following expression returns true, because the title element contains "Web Site Usability". A similar query on the p element would not return true, because its occurrences of "web site" and "usability" are not within a window of 3:

/books/book//title ftcontains "web site" ftand
"usability" window 3 words
"

But, actually the "Web Site Usavility" is the value of an attribute : shortTitle="Improving Web Site Usability"

But in the spec you say that
"but it occurs in an attribute value, and so is not subject to tokenization."

And in section 4.1.1 Examples you have an example:

<p kind='secret'>Sensitive material <!-- secret --></p> ftcontains 'secret'

"the following example must return false, because the 'secret' only occurs within an attribute and a comment, neither of which contributes characters to the string value of the 'p' element node:"

Kind regards,
Peter Pleshachkov
Comment 1 Jim Melton 2008-12-22 02:24:21 UTC
[personal response]

Peter, thanks for your comment.  It's gratifying to know that you're reviewing the spec so thoroughly. 

We have discovered (because of other comments, for example) that our explanation of tokenization is somewhat confusing.  We have adopted changes to that explanation that will be visible when we publish the next revision of the CR document. 

In the first example, you observe that:

   But, actually the "Web Site Usavility" is the value of an attribute :
   shortTitle="Improving Web Site Usability"

   But in the spec you say that
   "but it occurs in an attribute value, and so is not subject to tokenization."

The way that tokenization is intended to work is basically this: If a query explicitly requests to search within an attribute value, then the value of the attribute is tokenized and its tokens are the subject of the search.  However, if a query requests to search within element content only, the values of any attributes are not included in the search context and are therefore not subjects of the search. 

This is now clearly stated in Section 4.1, Tokenization, in list item 2, which reads:

   2. Tokenization of an item MUST include only tokens derived from the string value of that item. The string value is defined in [XQuery 1.0 and XPath 2.0 Data Model] in Section 2.6.5 String Values; for element nodes it does not include the contents of attributes, but for attribute nodes it does.


Therefore, I believe that the example in 3.6.2 Window Selection is correct as written and as explained.  Because this response has not been reviewed by the groups, I am not marking the comment INVALID, but I expect that will be the decision of the groups. 
Comment 2 Michael Dyck 2008-12-22 06:52:57 UTC
I disagree, I believe it's a valid bug.

The example query:
      /books/book//title ftcontains "web site" ftand
      "usability" window 3 words

The only 'title' in the sample doc is:
      <title shortTitle="Improving Web Site Usability">Improving  
        the Usability of a Web Site Through Expert Reviews and
        Usability Testing</title>

While it's true that the value of the shortTitle attribute contains "web site" and "usability" within a 3-word window, that value does not contribute to the string value of the title element, and so is not in the search context for this FTContainsExpr.

And while it's true that the string value of the title element does contain "web site" and "usability", the smallest window that contains both phrases is a 5-word window, which does not satisfy the FTWindow filter.

Thus, the prose is incorrect when it says:
    [The] expression returns true, because the title element
    contains "Web Site Usability".
Rather, the expression returns false, and the title element does not contain "Web Site Usability" (at least, not in a way that's pertinent to the example query).
Comment 3 Jim Melton 2008-12-22 19:07:33 UTC
Sigh...my bad...I somehow overlooked the fact that the window of "truth" was actually 5 words, not 3.  Michael is, of course, correct about that prose explaining the example being wrong. 
Comment 4 Mary Holstege 2009-02-19 21:54:43 UTC
Modified the description of the example and the comparison example (in the testsuite also) to clarify that a window of 5 is required and the phrase "Web Site Usability" in the attribute is not germane.

If you are satisfied with this resolution of the bug, please mark it as CLOSED.