This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 5251 - [FO] Minimal match in starts-with(), ends-with()
Summary: [FO] Minimal match in starts-with(), ends-with()
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 1.0 (show other bugs)
Version: Recommendation
Hardware: PC Windows XP
: P2 normal
Target Milestone: ---
Assignee: Michael Kay
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-06 00:24 UTC by Michael Kay
Modified: 2008-03-11 13:56 UTC (History)
0 users

See Also:


Attachments

Description Michael Kay 2007-11-06 00:24:24 UTC
The rules for starts-with say that the value of $arg1 must start with a sequence of collation units that provides a minimal match to the collation units of $arg2.

Consider starts-with(" banana", " b") where space is an ignorable collation unit. There is a match at the start of the string, but it is not a minimal match. The first minimal match starts at the second character, "b". It therefore seems that starts-with() should return false. This is counter-intuitive, and surely not what the working group intended.

The same applies to ends-with(). I don't think that the other substring-matching functions (contains, substring-before, substring-after) have similar problems.
Comment 1 Michael Kay 2008-01-15 00:03:39 UTC
There are a couple of further complications here.

Firstly, the definition of "minimal match" in Unicode UTS #10 appears to be incorrect. It states "The match is minimal if for all positive i and j, there is no match at Q[s+i,e-j]. In such a case, we also say that P minimal matchs at Q[s,e]." which would imply that a match at Q[2,4] is minimal even if there is a match at Q[3,4] or at Q[2,3] - it should allow either i or j (but not both) to be zero. I have raised this as a Unicode error report. I'll get around this by not referring normatively to the Unicode definition.

Secondly, the definition of minimal match is parameterized by a "boundary condition" B, which we have not specified. An example of such a boundary condition is a requirement that the match occur at a grapheme boundary, word boundary, or sentence boundary. I believe it's our intention that there should be no boundary constraints on the match.

Thirdly, F+O section 7.5 states: "In the definitions below, we say that $arg1 contains $arg2 at positions m through n if the collation units corresponding to characters in positions m to n of $arg1 are the same as the collation units corresponding to all the characters of $arg2 modulo ignorable collation units. In the simple case of the Unicode code point collation, the collation units are the same as the characters of the string. See [Unicode Collation Algorithm] for a detailed discussion of substring matching." However, the text below never relies on this definition; instead it refers directly to the concept of "minimal match".

PROPOSAL

1. Replace the cited definition in section 7.5 by: 

"[Definition: following [Unicode Collation Algorithm], we say that $arg1 *matches* $arg2 at positions M through M+L under collation C if fn:compare(fn:substring($arg1, M, L), $arg2, C) = 0, where M>=0 and M+L<=fn:string-length($arg1)] [Definition: we say that $arg1 *minimally matches* $arg2 at positions M through N if $arg1 *matches* $arg2 at positions M through N and if there are no non-negative integers i and j, not both equal to zero, such that $arg1 *matches* $arg2 at positions M+i through N-j.]

2. In all five functions, delete the note referring to "minimal match".

3. In contains(), replace the Summary with: 

"Summary: Returns an xs:boolean indicating whether or not the value of $arg1 contains the value of $arg2 (at the beginning, at the end, or anywhere within).

The result is true if and only if there is some pair of values M, N such that $arg1 matches $arg2 at positions M through N under the requested collation."

4. In starts-with(), replace the Summary with:

Summary: Returns an xs:boolean indicating whether or not the value of $arg1 starts with the value of $arg2.

The result is true if and only if there is some value N such that $arg1 matches $arg2 at positions 1 through N under the requested collation."

5. In ends-with(), replace the Summary with:

Summary: Returns an xs:boolean indicating whether or not the value of $arg1 ends with the value of $arg2.

The result is true if and only if there is some value M such that $arg1 matches $arg2 at positions M through N under the requested collation, where N is equal to fn:string-length($arg2)."

6. In substring-before, replace the Summary with:

Summary: Returns the substring of the value of $arg1 that precedes the first occurrence of $arg2.

If fn:contains($arg1, $arg2) is true then the function returns  fn:substring($arg1, 1, M), where M is the smallest integer such that $arg1 minimally matches $arg2 at positions M through N for some N. Otherwise the function returns the zero-length string.

Delete the now-redundant sentence "If the value of $arg1 does not contain a string that is equal to the value of $arg2, then the function returns the zero-length string."

7. In substring-after, replace the Summary with:

Summary: Returns the substring of the value of $arg1 that follows the first occurrence of $arg2.

If fn:contains($arg1, $arg2) is true then the function returns substring($arg1, N+1) where N is the smallest integer such that $arg1 minimally matches $arg2 at positions M through N for some M. Otherwise the function returns the zero-length string.

Delete the now-redundant sentence "If the value of $arg1 does not contain a string that is equal to the value of $arg2, then the function returns the zero-length string."

8. Add examples to starts-with() and ends-with() indicating the effect if $arg1 starts/ends with an ignorable.
Comment 2 Michael Kay 2008-01-15 11:06:42 UTC
After writing the proposal in comment #1 and sleeping on it, I realize that the approach of defining these functions in terms of simpler functions such as substring() and concat() could be taken further. Without relying on any definition of "match" or "minimal match", we can define the functions contains(), starts-with(), and ends-with() as follows:

contains(): return the result of the XPath expression:

some $M in 0 to string-length($arg1), $N in 0 to string-length($arg1) satisfies compare(substring($arg1, $M, $N), $arg2, $coll) eq 0

starts-with(): return the result of the XPath expression

some $N in 0 to string-length($arg1) satisfies compare(substring($arg1, 1, $N), $arg2, $coll) eq 0

ends-with(): return the result of the XPath expression

some $N in 0 to string-length($arg1) satisfies compare(substring($arg1, $N), $arg2, $coll) eq 0

substring-before(): return the result of the XQuery expression:

  if ((contains($arg1, $arg2, $coll))
  then
    let $Z := string-length($arg1)
    let $M := min((1 to $Z)[some $L in 0 to $Z satisfies compare(substring($arg1, ., $L), $arg2, $coll) eq 0])
        (: M is the position of the start of the first match :)
    let $L := max((1 to $Z)[compare(substring($arg1, $M, .), $arg2, $coll) eq 0])
        (: L is the length of the longest match at this position :) 
    let $I := max((0 to $L)[compare(substring($arg1, $M + $I, $L - $I), $arg2, $coll) eq 0]
        (: I is the number of ignorable characters at the start of this match,
           so M+I is the position of the start of the minimal match :)
    return substring($arg1, 1, $M + $I - 1).
        (: return the substring containing all characters before the start of the minimal match :)
  else ""

substring-after(): : return the result of the XQuery expression:

  if ((contains($arg1, $arg2, $coll))
  then
    let $Z := string-length($arg1)
    let $M := min((1 to $Z)[some $L in 0 to $Z satisfies compare(substring($arg1, ., $L), $arg2, $coll) eq 0])
        (: M is the position of the start of the first match :)
    let $L := max((1 to $Z)[compare(substring($arg1, $M, .), $arg2, $coll) eq 0])
        (: L is the length of the longest match at this position :)  
    let $I := max((0 to $L)[compare(substring($arg1, $M, $L - $I), $arg2, $coll) eq 0]
        (: I is the number of ignorable characters at the endof this match,
           so M+L-I is the position of the end of the minimal match :)
    return substring($arg1, $M + $L - $I)
        (: return all characters after the end of the first minimal match :)
  else ""

Comment 3 Colin Adams 2008-01-16 07:43:30 UTC
By defining XPath functions in terms of XQuery expressions, you introduce a dependency of the F&O document on the XQuery document.

I find this unfortunate, as that would make it necessary to learn XQuery in order to understand what to expect from these functions, or how to implement them.
Comment 4 Michael Kay 2008-01-18 09:56:18 UTC
OK, I'll try again.

The original error is in starts-with() and ends-with(), so let's leave substring-before() and substring-after() alone. The proposed change then becomes:

1. In 7.5, delete the paragraph starting "In the definitions below, we say that $arg1 contains $arg2 at positions m through n ..." because the paragraph is defining terminology that is never used.

2. In contains():

2a) replace the Summary with: 

"Summary: Returns an xs:boolean indicating whether or not the value of $arg1
contains the value of $arg2 (at the beginning, at the end, or anywhere within).

Let $coll be $collation if specified, or the default collation otherwise. The result of the function is the value of the following expression:

some $M in 1 to string-length($arg1), $N in 0 to string-length($arg1) satisfies
(compare(substring($arg1, $M, $N), $arg2, $coll) eq 0)

2b) delete the note referring to "Minimal match"

3. In starts-with():

3a) replace the Summary with: 

"Summary: Returns an xs:boolean indicating whether or not the value of $arg1
has the value of $arg2 as a leading substring.

Let $coll be $collation if specified, or the default collation otherwise. The result of the function is the value of the following expression:

some $N in 0 to string-length($arg1) satisfies
(compare(substring($arg1, 1, $N), $arg2, $coll) eq 0)

3b) delete the note referring to "Minimal match"

3c) add the example

fn:starts-with ( "-abcdefghi", "-abc", "CollationA") returns true.

[Under the current rules requiring a minimal match, this one returns false)

4. In ends-with():

4a) replace the Summary with: 

"Summary: Returns an xs:boolean indicating whether or not the value of $arg1
has the value of $arg2 as a trailing substring.

Let $coll be $collation if specified, or the default collation otherwise. The result of the function is is the value of the following expression:

some $N in 1 to string-length($arg1)+1 satisfies
(compare(substring($arg1, $N), $arg2, $coll) eq 0)

4b) delete the note referring to "Minimal match"

4c) add the example

fn:ends-with ( "abcdefghi-", "ghi-", "CollationA") returns true.

[Under the current rules requiring a minimal match, this one returns false)
Comment 5 Michael Kay 2008-02-13 16:23:51 UTC
The proposal was discussed at the telcon on 2008-02-12. There were two main reservations expressed:

(a) the use of compare(substring(xxx)) was not equivalent to the current behaviour because the substring function operates without knowledge of a collation; thus this might find a match where the current spec would not. (See minutes for example)

(b) Jim Melton felt unease about whether this was really a bug: were we sure the status quo wasn't what the WG intended?

I would like to revise the proposal to make it a very simple change. In starts-with() and ends-with(), (a) change "minimal match" to "match", and (b) add the examples proposed in comment #4.

The relevant definitions from Unicode are:

DS2. There is a match according to C for P within Q[s,e] if and only if C generates the same sort key for P as for Q[s,e], and the offsets s and e meet the condition B.

DS4. The match is minimal if for all positive i and j, there is no match at Q[s+i,e-j]. In such a case, we also say that P minimal matchs at Q[s,e].

Here C is the collation, B in our case is true so long as we are on a boundary between collation units (we don't say this very clearly, but it's the best interpretation), P is our second argument, and Q is the first argument. s and e are character positions.

Note that DS4 is incorrect, is should only require one of i and j to be positive, the other can be zero. This bug has been reported and accepted.

The current rule (for starts-with) requires a minimal match at the start of the arg1 string. This means that if "-" is ignorable (as it often is), then starts-with('-1', '-1') is false: the possible matches are on '-1' and '1'; the first match doesn't count because it is not minimal, and the second doesn't count because it is not at the start of the string. I find it impossible to believe that the WG intended this: it's surely a reasonable expectation that every string starts with itself and ends with itself.
Comment 6 Mary Holstege 2008-02-13 17:57:12 UTC
I think this is a good solution.  Are you proposing that we add something
explicit about being on a collation unit boundary?
Comment 7 Michael Kay 2008-02-13 18:46:30 UTC
>Are you proposing that we add something explicit about being on a collation unit boundary?

I would certainly like to tie the F+O reference to "[minimal] match" more closely to the terminology of the UCA rules.

The UCA rules are introduced by "Suppose there is a collation C, a pattern string P and a target string Q, and a boundary condition B." So I think we should probably say something like the following (taking contains() as our example): 

Returns true if and only if there is a minimal match, as defined in [UCA], under the following conditions:

* The pattern string P is $arg2

* The target string Q is $arg1

* The collation C is $collation if specified, or the default collation otherwise

* The boundary condition B is true between two characters that belong to different collation units, and is false between two characters that belong to the same collation unit.

This is essentially what the current F+O spec is trying to say when it talks about a "sequence of collation units" matching: it just doesn't use the UCA terminology.
Comment 8 Michael Kay 2008-03-11 13:55:47 UTC
At the joint meeting on 2008-03-04 it was agreed to issue an erratum incorporating the changes described in proposals #5 and #7. 

Erratum E17 has been drafted to reflect this resolution.