This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 12057 - [FT] Sentence breaks
Summary: [FT] Sentence breaks
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Full Text 1.0 (show other bugs)
Version: Proposed Recommendation
Hardware: PC Windows NT
: P2 normal
Target Milestone: ---
Assignee: Jim Melton
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-02-14 09:50 UTC by Tim Mills
Modified: 2011-04-04 12:16 UTC (History)
3 users (show)

See Also:


Attachments

Description Tim Mills 2011-02-14 09:50:35 UTC
In Section 3. Full-Text, the text

"This sample tokenization uses white space, punctuation and XML tags as word-breakers and <p> for paragraph boundaries. The results may be different for other tokenizations."

fails to state what rule has been used to identify sentence boundaries.  The guidelines for running the test suite give the rule as:

"sentences are separated by a period (a/k/a "full stop") followed immediately by white space,"

The example in 3 Full-Text Selections uses the following XML.

<books>
  <book number="1">
    <title shortTitle="Improving Web Site Usability">Improving  
        the Usability of a Web Site Through Expert Reviews and
        Usability Testing</title>
    <author>Millicent Marigold</author>
    <author>Montana Marigold</author>
    <editor>VĂ©ra Tudor-Medina</editor>
    <content>
      <p>The usability of a Web site is how well the  
          site supports the users in achieving specified  
          goals. A Web site should facilitate learning,  
          and enable efficient and effective task  
          completion, while propagating few errors.
      </p>
      <note>This book has been approved by the Web Site  
          Users Association.
      </note>
    </content>
  </book>
</books>

Following the rule for sentence breaking from the test stuie guidelines, test  examples-364-2 derived from section 3.6.4 of the specification appears to be incorrect.  The specification says:

The following expression returns true, because the tokens "usability" and "Marigold" are contained within different sentences:

//book contains text "usability" ftand "Marigold" different sentence

However, the first sentence break appears after the word "goals", so the two words only ever appear in the same sentence.

There is no suggestion in the text that the beginning (end) of a paragraph necessarily start (ends) a sentence.

It is also unclear how paragraph boundaries are identified.  Consider the following input:

<root>
  A <p>B</p> C
</root>

I can see three possibilities:

1.  There are three paragraphs: one containing A, one containing B and one containing C).
2.  There are two paragraphs: one containing A, one containing B C.
3.  There are two paragraphs: one containing A B, one containing C.

It is not clear from the specification which interpretation is correct.
Comment 1 Michael Dyck 2011-02-21 21:18:08 UTC
(personal response:)

(In reply to comment #0)
> In Section 3. Full-Text, the text
> 
> "This sample tokenization uses white space, punctuation and XML tags as
> word-breakers and <p> for paragraph boundaries. The results may be different
> for other tokenizations."
> 
> fails to state what rule has been used to identify sentence boundaries.

Hm, right. Since we have some examples involving sentences (in section 3.6.4), we should probably copy the text you quoted from test suite's guidelines.

> There is no suggestion in the text that the beginning (end) of a paragraph
> necessarily start (ends) a sentence.

We should probably add that to the description of the sample tokenization.

> It is also unclear how paragraph boundaries are identified.  Consider the
> following input:
> 
> <root>
>   A <p>B</p> C
> </root>
> 
> I can see three possibilities:
> 
> 1.  There are three paragraphs: one containing A, one containing B and one
> containing C).
> 2.  There are two paragraphs: one containing A, one containing B C.
> 3.  There are two paragraphs: one containing A B, one containing C.
> 
> It is not clear from the specification which interpretation is correct.

I think they're all conformant (and there are perhaps other possibilities). It's up to each implementation to indicate how it identifies paragraph boundaries (if it supports paragraphs).

In the sample tokenization, I'd say it's clear that A and B are not in the same paragraph (because <p> is a "paragraph boundary"), so #3 is out. I'm not sure it's necessary to describe the sample tokenization precisely enough to distinguish between #1 and #2 -- do you know of any examples or tests where it makes a difference to the result?
Comment 2 Liam R E Quin 2011-02-21 21:28:27 UTC
Note, in some fields of discourse a sentence can indeed span multiple paragraph-like objects - e.g. poetry, or Biblical verses.

I don't think it necessary to forbid this.
Comment 3 Michael Dyck 2011-02-21 22:31:10 UTC
> I don't think it necessary to forbid this.

The spec doesn't forbid implementations from supporting such definitions of paragraph and sentence, and the comments above are not suggesting it should.

This issue is about the sample tokenization, which does not affect what the spec allows or disallows of implementations.
Comment 4 Tim Mills 2011-02-22 08:45:02 UTC
> Do you know of any examples or tests where it makes a difference to the result?

I'm afraid I can't name one, but I'm positive there is one as I had to fiddle around with our tokenizer to match the behaviour expected by the test suite.
Comment 5 Mary Holstege 2011-02-22 15:24:06 UTC
Our spec also says that words, sentences, and paragraphs form a hierarchy, so a paragraph break also implies a sentence break.  Since the instructions say that <p> is a paragraph break. Therefore, there is a sentence boundary at the start of the first <p> element.
Comment 6 Michael Dyck 2011-02-22 16:13:40 UTC
(In reply to comment #5)
> Our spec also says that words, sentences, and paragraphs form a hierarchy,

My mistake. The relevant text is in 4.1 Tokenization, point 6 ("The tokenizer MUST preserve the containment hierarchy"). So the spec *does* forbid what Liam suggests in comment #2.
Comment 7 Mary Holstege 2011-03-01 18:06:43 UTC
WG agrees to resolve this bug by adding clarifying text to section 3 of the specification. We believe the instructions for the testsuite already say this.

Please indicate your satisfaction with this resolution by marking the bug as CLOSED.