<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>12057</bug_id>
          
          <creation_ts>2011-02-14 09:50:35 +0000</creation_ts>
          <short_desc>[FT] Sentence breaks</short_desc>
          <delta_ts>2011-04-04 12:16:21 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>Full Text 1.0</component>
          <version>Proposed Recommendation</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows NT</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Tim Mills">tim</reporter>
          <assigned_to name="Jim Melton">jim.melton</assigned_to>
          <cc>holstege</cc>
    
    <cc>jmdyck</cc>
    
    <cc>liam</cc>
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>45464</commentid>
    <comment_count>0</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2011-02-14 09:50:35 +0000</bug_when>
    <thetext>In Section 3. Full-Text, the text

&quot;This sample tokenization uses white space, punctuation and XML tags as word-breakers and &lt;p&gt; for paragraph boundaries. The results may be different for other tokenizations.&quot;

fails to state what rule has been used to identify sentence boundaries.  The guidelines for running the test suite give the rule as:

&quot;sentences are separated by a period (a/k/a &quot;full stop&quot;) followed immediately by white space,&quot;

The example in 3 Full-Text Selections uses the following XML.

&lt;books&gt;
  &lt;book number=&quot;1&quot;&gt;
    &lt;title shortTitle=&quot;Improving Web Site Usability&quot;&gt;Improving  
        the Usability of a Web Site Through Expert Reviews and
        Usability Testing&lt;/title&gt;
    &lt;author&gt;Millicent Marigold&lt;/author&gt;
    &lt;author&gt;Montana Marigold&lt;/author&gt;
    &lt;editor&gt;Véra Tudor-Medina&lt;/editor&gt;
    &lt;content&gt;
      &lt;p&gt;The usability of a Web site is how well the  
          site supports the users in achieving specified  
          goals. A Web site should facilitate learning,  
          and enable efficient and effective task  
          completion, while propagating few errors.
      &lt;/p&gt;
      &lt;note&gt;This book has been approved by the Web Site  
          Users Association.
      &lt;/note&gt;
    &lt;/content&gt;
  &lt;/book&gt;
&lt;/books&gt;

Following the rule for sentence breaking from the test stuie guidelines, test  examples-364-2 derived from section 3.6.4 of the specification appears to be incorrect.  The specification says:

The following expression returns true, because the tokens &quot;usability&quot; and &quot;Marigold&quot; are contained within different sentences:

//book contains text &quot;usability&quot; ftand &quot;Marigold&quot; different sentence

However, the first sentence break appears after the word &quot;goals&quot;, so the two words only ever appear in the same sentence.

There is no suggestion in the text that the beginning (end) of a paragraph necessarily start (ends) a sentence.

It is also unclear how paragraph boundaries are identified.  Consider the following input:

&lt;root&gt;
  A &lt;p&gt;B&lt;/p&gt; C
&lt;/root&gt;

I can see three possibilities:

1.  There are three paragraphs: one containing A, one containing B and one containing C).
2.  There are two paragraphs: one containing A, one containing B C.
3.  There are two paragraphs: one containing A B, one containing C.

It is not clear from the specification which interpretation is correct.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45905</commentid>
    <comment_count>1</comment_count>
    <who name="Michael Dyck">jmdyck</who>
    <bug_when>2011-02-21 21:18:08 +0000</bug_when>
    <thetext>(personal response:)

(In reply to comment #0)
&gt; In Section 3. Full-Text, the text
&gt; 
&gt; &quot;This sample tokenization uses white space, punctuation and XML tags as
&gt; word-breakers and &lt;p&gt; for paragraph boundaries. The results may be different
&gt; for other tokenizations.&quot;
&gt; 
&gt; fails to state what rule has been used to identify sentence boundaries.

Hm, right. Since we have some examples involving sentences (in section 3.6.4), we should probably copy the text you quoted from test suite&apos;s guidelines.

&gt; There is no suggestion in the text that the beginning (end) of a paragraph
&gt; necessarily start (ends) a sentence.

We should probably add that to the description of the sample tokenization.

&gt; It is also unclear how paragraph boundaries are identified.  Consider the
&gt; following input:
&gt; 
&gt; &lt;root&gt;
&gt;   A &lt;p&gt;B&lt;/p&gt; C
&gt; &lt;/root&gt;
&gt; 
&gt; I can see three possibilities:
&gt; 
&gt; 1.  There are three paragraphs: one containing A, one containing B and one
&gt; containing C).
&gt; 2.  There are two paragraphs: one containing A, one containing B C.
&gt; 3.  There are two paragraphs: one containing A B, one containing C.
&gt; 
&gt; It is not clear from the specification which interpretation is correct.

I think they&apos;re all conformant (and there are perhaps other possibilities). It&apos;s up to each implementation to indicate how it identifies paragraph boundaries (if it supports paragraphs).

In the sample tokenization, I&apos;d say it&apos;s clear that A and B are not in the same paragraph (because &lt;p&gt; is a &quot;paragraph boundary&quot;), so #3 is out. I&apos;m not sure it&apos;s necessary to describe the sample tokenization precisely enough to distinguish between #1 and #2 -- do you know of any examples or tests where it makes a difference to the result?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45907</commentid>
    <comment_count>2</comment_count>
    <who name="Liam R E Quin">liam</who>
    <bug_when>2011-02-21 21:28:27 +0000</bug_when>
    <thetext>Note, in some fields of discourse a sentence can indeed span multiple paragraph-like objects - e.g. poetry, or Biblical verses.

I don&apos;t think it necessary to forbid this.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45909</commentid>
    <comment_count>3</comment_count>
    <who name="Michael Dyck">jmdyck</who>
    <bug_when>2011-02-21 22:31:10 +0000</bug_when>
    <thetext>&gt; I don&apos;t think it necessary to forbid this.

The spec doesn&apos;t forbid implementations from supporting such definitions of paragraph and sentence, and the comments above are not suggesting it should.

This issue is about the sample tokenization, which does not affect what the spec allows or disallows of implementations.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45917</commentid>
    <comment_count>4</comment_count>
    <who name="Tim Mills">tim</who>
    <bug_when>2011-02-22 08:45:02 +0000</bug_when>
    <thetext>&gt; Do you know of any examples or tests where it makes a difference to the result?

I&apos;m afraid I can&apos;t name one, but I&apos;m positive there is one as I had to fiddle around with our tokenizer to match the behaviour expected by the test suite.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45919</commentid>
    <comment_count>5</comment_count>
    <who name="Mary Holstege">holstege</who>
    <bug_when>2011-02-22 15:24:06 +0000</bug_when>
    <thetext>Our spec also says that words, sentences, and paragraphs form a hierarchy, so a paragraph break also implies a sentence break.  Since the instructions say that &lt;p&gt; is a paragraph break. Therefore, there is a sentence boundary at the start of the first &lt;p&gt; element.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>45922</commentid>
    <comment_count>6</comment_count>
    <who name="Michael Dyck">jmdyck</who>
    <bug_when>2011-02-22 16:13:40 +0000</bug_when>
    <thetext>(In reply to comment #5)
&gt; Our spec also says that words, sentences, and paragraphs form a hierarchy,

My mistake. The relevant text is in 4.1 Tokenization, point 6 (&quot;The tokenizer MUST preserve the containment hierarchy&quot;). So the spec *does* forbid what Liam suggests in comment #2.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>46147</commentid>
    <comment_count>7</comment_count>
    <who name="Mary Holstege">holstege</who>
    <bug_when>2011-03-01 18:06:43 +0000</bug_when>
    <thetext>WG agrees to resolve this bug by adding clarifying text to section 3 of the specification. We believe the instructions for the testsuite already say this.

Please indicate your satisfaction with this resolution by marking the bug as CLOSED.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>