<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>5251</bug_id>
          
          <creation_ts>2007-11-06 00:24:24 +0000</creation_ts>
          <short_desc>[FO] Minimal match in starts-with(), ends-with()</short_desc>
          <delta_ts>2008-03-11 13:56:12 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>XPath / XQuery / XSLT</product>
          <component>Functions and Operators 1.0</component>
          <version>Recommendation</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows XP</op_sys>
          <bug_status>CLOSED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Michael Kay">mike</reporter>
          <assigned_to name="Michael Kay">mike</assigned_to>
          
          
          <qa_contact name="Mailing list for public feedback on specs from XSL and XML Query WGs">public-qt-comments</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>17601</commentid>
    <comment_count>0</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2007-11-06 00:24:24 +0000</bug_when>
    <thetext>The rules for starts-with say that the value of $arg1 must start with a sequence of collation units that provides a minimal match to the collation units of $arg2.

Consider starts-with(&quot; banana&quot;, &quot; b&quot;) where space is an ignorable collation unit. There is a match at the start of the string, but it is not a minimal match. The first minimal match starts at the second character, &quot;b&quot;. It therefore seems that starts-with() should return false. This is counter-intuitive, and surely not what the working group intended.

The same applies to ends-with(). I don&apos;t think that the other substring-matching functions (contains, substring-before, substring-after) have similar problems.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>18294</commentid>
    <comment_count>1</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2008-01-15 00:03:39 +0000</bug_when>
    <thetext>There are a couple of further complications here.

Firstly, the definition of &quot;minimal match&quot; in Unicode UTS #10 appears to be incorrect. It states &quot;The match is minimal if for all positive i and j, there is no match at Q[s+i,e-j]. In such a case, we also say that P minimal matchs at Q[s,e].&quot; which would imply that a match at Q[2,4] is minimal even if there is a match at Q[3,4] or at Q[2,3] - it should allow either i or j (but not both) to be zero. I have raised this as a Unicode error report. I&apos;ll get around this by not referring normatively to the Unicode definition.

Secondly, the definition of minimal match is parameterized by a &quot;boundary condition&quot; B, which we have not specified. An example of such a boundary condition is a requirement that the match occur at a grapheme boundary, word boundary, or sentence boundary. I believe it&apos;s our intention that there should be no boundary constraints on the match.

Thirdly, F+O section 7.5 states: &quot;In the definitions below, we say that $arg1 contains $arg2 at positions m through n if the collation units corresponding to characters in positions m to n of $arg1 are the same as the collation units corresponding to all the characters of $arg2 modulo ignorable collation units. In the simple case of the Unicode code point collation, the collation units are the same as the characters of the string. See [Unicode Collation Algorithm] for a detailed discussion of substring matching.&quot; However, the text below never relies on this definition; instead it refers directly to the concept of &quot;minimal match&quot;.

PROPOSAL

1. Replace the cited definition in section 7.5 by: 

&quot;[Definition: following [Unicode Collation Algorithm], we say that $arg1 *matches* $arg2 at positions M through M+L under collation C if fn:compare(fn:substring($arg1, M, L), $arg2, C) = 0, where M&gt;=0 and M+L&lt;=fn:string-length($arg1)] [Definition: we say that $arg1 *minimally matches* $arg2 at positions M through N if $arg1 *matches* $arg2 at positions M through N and if there are no non-negative integers i and j, not both equal to zero, such that $arg1 *matches* $arg2 at positions M+i through N-j.]

2. In all five functions, delete the note referring to &quot;minimal match&quot;.

3. In contains(), replace the Summary with: 

&quot;Summary: Returns an xs:boolean indicating whether or not the value of $arg1 contains the value of $arg2 (at the beginning, at the end, or anywhere within).

The result is true if and only if there is some pair of values M, N such that $arg1 matches $arg2 at positions M through N under the requested collation.&quot;

4. In starts-with(), replace the Summary with:

Summary: Returns an xs:boolean indicating whether or not the value of $arg1 starts with the value of $arg2.

The result is true if and only if there is some value N such that $arg1 matches $arg2 at positions 1 through N under the requested collation.&quot;

5. In ends-with(), replace the Summary with:

Summary: Returns an xs:boolean indicating whether or not the value of $arg1 ends with the value of $arg2.

The result is true if and only if there is some value M such that $arg1 matches $arg2 at positions M through N under the requested collation, where N is equal to fn:string-length($arg2).&quot;

6. In substring-before, replace the Summary with:

Summary: Returns the substring of the value of $arg1 that precedes the first occurrence of $arg2.

If fn:contains($arg1, $arg2) is true then the function returns  fn:substring($arg1, 1, M), where M is the smallest integer such that $arg1 minimally matches $arg2 at positions M through N for some N. Otherwise the function returns the zero-length string.

Delete the now-redundant sentence &quot;If the value of $arg1 does not contain a string that is equal to the value of $arg2, then the function returns the zero-length string.&quot;

7. In substring-after, replace the Summary with:

Summary: Returns the substring of the value of $arg1 that follows the first occurrence of $arg2.

If fn:contains($arg1, $arg2) is true then the function returns substring($arg1, N+1) where N is the smallest integer such that $arg1 minimally matches $arg2 at positions M through N for some M. Otherwise the function returns the zero-length string.

Delete the now-redundant sentence &quot;If the value of $arg1 does not contain a string that is equal to the value of $arg2, then the function returns the zero-length string.&quot;

8. Add examples to starts-with() and ends-with() indicating the effect if $arg1 starts/ends with an ignorable.
</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>18311</commentid>
    <comment_count>2</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2008-01-15 11:06:42 +0000</bug_when>
    <thetext>After writing the proposal in comment #1 and sleeping on it, I realize that the approach of defining these functions in terms of simpler functions such as substring() and concat() could be taken further. Without relying on any definition of &quot;match&quot; or &quot;minimal match&quot;, we can define the functions contains(), starts-with(), and ends-with() as follows:

contains(): return the result of the XPath expression:

some $M in 0 to string-length($arg1), $N in 0 to string-length($arg1) satisfies compare(substring($arg1, $M, $N), $arg2, $coll) eq 0

starts-with(): return the result of the XPath expression

some $N in 0 to string-length($arg1) satisfies compare(substring($arg1, 1, $N), $arg2, $coll) eq 0

ends-with(): return the result of the XPath expression

some $N in 0 to string-length($arg1) satisfies compare(substring($arg1, $N), $arg2, $coll) eq 0

substring-before(): return the result of the XQuery expression:

  if ((contains($arg1, $arg2, $coll))
  then
    let $Z := string-length($arg1)
    let $M := min((1 to $Z)[some $L in 0 to $Z satisfies compare(substring($arg1, ., $L), $arg2, $coll) eq 0])
        (: M is the position of the start of the first match :)
    let $L := max((1 to $Z)[compare(substring($arg1, $M, .), $arg2, $coll) eq 0])
        (: L is the length of the longest match at this position :) 
    let $I := max((0 to $L)[compare(substring($arg1, $M + $I, $L - $I), $arg2, $coll) eq 0]
        (: I is the number of ignorable characters at the start of this match,
           so M+I is the position of the start of the minimal match :)
    return substring($arg1, 1, $M + $I - 1).
        (: return the substring containing all characters before the start of the minimal match :)
  else &quot;&quot;

substring-after(): : return the result of the XQuery expression:

  if ((contains($arg1, $arg2, $coll))
  then
    let $Z := string-length($arg1)
    let $M := min((1 to $Z)[some $L in 0 to $Z satisfies compare(substring($arg1, ., $L), $arg2, $coll) eq 0])
        (: M is the position of the start of the first match :)
    let $L := max((1 to $Z)[compare(substring($arg1, $M, .), $arg2, $coll) eq 0])
        (: L is the length of the longest match at this position :)  
    let $I := max((0 to $L)[compare(substring($arg1, $M, $L - $I), $arg2, $coll) eq 0]
        (: I is the number of ignorable characters at the endof this match,
           so M+L-I is the position of the end of the minimal match :)
    return substring($arg1, $M + $L - $I)
        (: return all characters after the end of the first minimal match :)
  else &quot;&quot;

</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>18351</commentid>
    <comment_count>3</comment_count>
    <who name="Colin Adams">colin</who>
    <bug_when>2008-01-16 07:43:30 +0000</bug_when>
    <thetext>By defining XPath functions in terms of XQuery expressions, you introduce a dependency of the F&amp;O document on the XQuery document.

I find this unfortunate, as that would make it necessary to learn XQuery in order to understand what to expect from these functions, or how to implement them.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>18414</commentid>
    <comment_count>4</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2008-01-18 09:56:18 +0000</bug_when>
    <thetext>OK, I&apos;ll try again.

The original error is in starts-with() and ends-with(), so let&apos;s leave substring-before() and substring-after() alone. The proposed change then becomes:

1. In 7.5, delete the paragraph starting &quot;In the definitions below, we say that $arg1 contains $arg2 at positions m through n ...&quot; because the paragraph is defining terminology that is never used.

2. In contains():

2a) replace the Summary with: 

&quot;Summary: Returns an xs:boolean indicating whether or not the value of $arg1
contains the value of $arg2 (at the beginning, at the end, or anywhere within).

Let $coll be $collation if specified, or the default collation otherwise. The result of the function is the value of the following expression:

some $M in 1 to string-length($arg1), $N in 0 to string-length($arg1) satisfies
(compare(substring($arg1, $M, $N), $arg2, $coll) eq 0)

2b) delete the note referring to &quot;Minimal match&quot;

3. In starts-with():

3a) replace the Summary with: 

&quot;Summary: Returns an xs:boolean indicating whether or not the value of $arg1
has the value of $arg2 as a leading substring.

Let $coll be $collation if specified, or the default collation otherwise. The result of the function is the value of the following expression:

some $N in 0 to string-length($arg1) satisfies
(compare(substring($arg1, 1, $N), $arg2, $coll) eq 0)

3b) delete the note referring to &quot;Minimal match&quot;

3c) add the example

fn:starts-with ( &quot;-abcdefghi&quot;, &quot;-abc&quot;, &quot;CollationA&quot;) returns true.

[Under the current rules requiring a minimal match, this one returns false)

4. In ends-with():

4a) replace the Summary with: 

&quot;Summary: Returns an xs:boolean indicating whether or not the value of $arg1
has the value of $arg2 as a trailing substring.

Let $coll be $collation if specified, or the default collation otherwise. The result of the function is is the value of the following expression:

some $N in 1 to string-length($arg1)+1 satisfies
(compare(substring($arg1, $N), $arg2, $coll) eq 0)

4b) delete the note referring to &quot;Minimal match&quot;

4c) add the example

fn:ends-with ( &quot;abcdefghi-&quot;, &quot;ghi-&quot;, &quot;CollationA&quot;) returns true.

[Under the current rules requiring a minimal match, this one returns false)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>19005</commentid>
    <comment_count>5</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2008-02-13 16:23:51 +0000</bug_when>
    <thetext>The proposal was discussed at the telcon on 2008-02-12. There were two main reservations expressed:

(a) the use of compare(substring(xxx)) was not equivalent to the current behaviour because the substring function operates without knowledge of a collation; thus this might find a match where the current spec would not. (See minutes for example)

(b) Jim Melton felt unease about whether this was really a bug: were we sure the status quo wasn&apos;t what the WG intended?

I would like to revise the proposal to make it a very simple change. In starts-with() and ends-with(), (a) change &quot;minimal match&quot; to &quot;match&quot;, and (b) add the examples proposed in comment #4.

The relevant definitions from Unicode are:

DS2. There is a match according to C for P within Q[s,e] if and only if C generates the same sort key for P as for Q[s,e], and the offsets s and e meet the condition B.

DS4. The match is minimal if for all positive i and j, there is no match at Q[s+i,e-j]. In such a case, we also say that P minimal matchs at Q[s,e].

Here C is the collation, B in our case is true so long as we are on a boundary between collation units (we don&apos;t say this very clearly, but it&apos;s the best interpretation), P is our second argument, and Q is the first argument. s and e are character positions.

Note that DS4 is incorrect, is should only require one of i and j to be positive, the other can be zero. This bug has been reported and accepted.

The current rule (for starts-with) requires a minimal match at the start of the arg1 string. This means that if &quot;-&quot; is ignorable (as it often is), then starts-with(&apos;-1&apos;, &apos;-1&apos;) is false: the possible matches are on &apos;-1&apos; and &apos;1&apos;; the first match doesn&apos;t count because it is not minimal, and the second doesn&apos;t count because it is not at the start of the string. I find it impossible to believe that the WG intended this: it&apos;s surely a reasonable expectation that every string starts with itself and ends with itself.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>19008</commentid>
    <comment_count>6</comment_count>
    <who name="Mary Holstege">holstege</who>
    <bug_when>2008-02-13 17:57:12 +0000</bug_when>
    <thetext>
I think this is a good solution.  Are you proposing that we add something
explicit about being on a collation unit boundary?</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>19009</commentid>
    <comment_count>7</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2008-02-13 18:46:30 +0000</bug_when>
    <thetext>&gt;Are you proposing that we add something explicit about being on a collation unit boundary?

I would certainly like to tie the F+O reference to &quot;[minimal] match&quot; more closely to the terminology of the UCA rules.

The UCA rules are introduced by &quot;Suppose there is a collation C, a pattern string P and a target string Q, and a boundary condition B.&quot; So I think we should probably say something like the following (taking contains() as our example): 

Returns true if and only if there is a minimal match, as defined in [UCA], under the following conditions:

* The pattern string P is $arg2

* The target string Q is $arg1

* The collation C is $collation if specified, or the default collation otherwise

* The boundary condition B is true between two characters that belong to different collation units, and is false between two characters that belong to the same collation unit.

This is essentially what the current F+O spec is trying to say when it talks about a &quot;sequence of collation units&quot; matching: it just doesn&apos;t use the UCA terminology.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>19406</commentid>
    <comment_count>8</comment_count>
    <who name="Michael Kay">mike</who>
    <bug_when>2008-03-11 13:55:47 +0000</bug_when>
    <thetext>At the joint meeting on 2008-03-04 it was agreed to issue an erratum incorporating the changes described in proposals #5 and #7. 

Erratum E17 has been drafted to reflect this resolution.</thetext>
  </long_desc>
      
      

    </bug>

</bugzilla>