This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1850 - [F&O] how do ranges work in case-insensitive mode?
Summary: [F&O] how do ranges work in case-insensitive mode?
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 1.0 (show other bugs)
Version: Last Call drafts
Hardware: PC Windows 2000
: P2 normal
Target Milestone: ---
Assignee: Ashok Malhotra
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-16 19:31 UTC by Fred Zemke
Modified: 2005-09-29 12:52 UTC (History)
0 users

See Also:


Attachments

Description Fred Zemke 2005-08-16 19:31:08 UTC
Section 7.6.1.1 Flags describes a flag "i" which places F&O pattern searches
into case-insensitive mode.  Patterns are mostly described in XML Schema, and
include a capability to express a range in a bracket expression.  Since XML
Schema does not have case-insensitive matches, it does not define how a
case-insensitive range works.  This needs to be specified here.  

I can think of at least three possible definitions.  

1. The first algorithm is to form the case-insensitive range of 
the first and second operands, then add in anything that is a case-insensitive
version of something in this range.

2. In  the second algorithm, let f be the lowercase version of the first
operand, F be the uppercase equivalent of f, s be the lowercase version of
the second operand, and S the uppercase equivalent of s.  
Let m be the minimum of f and F, and let M be the maximum of s and S.  
The range is m-M, everything between m and M inclusive.

3. The third algorithm is the case-sensitive range f-s union with the 
case-sentitive range F-S.
Comment 1 Michael Kay 2005-08-17 13:11:32 UTC
Excellent point.

I think the rule that works best is to expand the range, e.g. [a-h] becomes
[abcdefgh], and then match this with the "i" flag, applying the existing rule in
the spec "a character in the input string matches a character specified by the
pattern if there is a default case mapping between the two characters as defined
in section 3.13 of [The Unicode Standard]." (Is this the same as your first
suggestion?)

As far as I can tell by experiment, this seems to be the way it works in Java
(which is modelled on Perl). 

I'm having a bit more trouble divining the semantics for subtractions and
negative groups: at present in Saxon

  matches('G','[A-Z-[f-h]]','i')
and
  matches('G','[A-Z-[F-H]]','i')

both return true, which is a little surprising, while

  matches('G','[A-Z-[F-Hf-h]]','i')

returns false. And

  matches('G','[^G]','i')  = false
while
  matches('G','[^F-H]',i') = true

I need to do a bit more investigation to see whether it's Java that's behaving
this way, or whether its a consequence of the way I translate XPath regex to
Java regex syntax (I use James Clark's code for this, modified to handle the
XPath extensions to Schema regex syntax).

If anyone can do some experiments with Perl, that would be useful...

Michael Kay
Comment 2 Michael Kay 2005-08-17 13:21:27 UTC
A cheap temporary fix to some of these problems would be to say that it's an
error to use the "i" flag with a regex that contains a negative character group,
a character class subtraction, or a complemented category escape (such as
"\P{Lu}"). That would keep our options open to get it right in the future.
Comment 3 Michael Kay 2005-08-31 14:43:56 UTC
Here are some observations from Java:

(a) it appears that a character matches a range if any case-variant of the
character matches the range:

   matches("D", "[A-Z]", "i")  = true
   matches("d", "[A-Z]", "i")  = true

(b) this rule also works for subtractions:

   matches("D", "[A-Z-[D]]", "i")  = true
   matches("d", "[A-Z-[D]]", "i")  = true

(c) the rule doesn't work for negative character groups. Here it appears that ^d
removes both "d" and "D" from the group (whereas the rule above would suggest
that it removes neither)

   matches("D", "[^d]", "i")  = false
   matches("d", "[^d]", "i")  = false

(d) it appears that the "i" flag has no effect on character blocks. 

   matches("D", "\p[Lu]", "i") = true;
   matches("d", "\p[Lu]", "i") = false;
   matches("D", "\P[Lu]", "i") = false;
   matches("d", "\P[Lu]", "i") = true;

This is a terribly empirical way of approaching a specification!
Comment 4 Mary Holstege 2005-08-31 15:08:36 UTC
I suspect empiricism here is telling us less about the language spec and more
about how carefully the implementors thought all the weird cases through.
It would be interesting to see if different JVMs are consistent here.

I think we can go back to first principles a bit: we say
"In case-insensitive mode, a character in the input string matches a character 
specified by the pattern if there is a default case mapping between the two 
characters as defined in section 3.13 of [The Unicode Standard]."

In the case of a character range, I would take "a character specified by the 
pattern" to be every character in that character range, so if there is a default 
case mapping between the input string character and any of them, its a match. 
Likewise for negative character ranges and so on.

That is, you don't mess with the pattern, you check the input string with case 
folding against the pattern as written. So I think (* = different from Java
reported results):
   matches("D", "[A-Z]", "i")  = true
   matches("d", "[A-Z]", "i")  = true
 * matches("D", "[A-Z-[D]]", "i")  = false
 * matches("d", "[A-Z-[D]]", "i")  = false

   matches("D", "[^d]", "i")  = false
   matches("d", "[^d]", "i")  = false

   matches("D", "\p{Lu}", "i") = true
 * matches("d", "\p{Lu}", "i") = true
   matches("D", "\P{Lu}", "i") = false
 * matches("d", "\P{Lu}", "i") = false
Comment 5 Liam R E Quin 2005-08-31 20:58:20 UTC
The examples from Mike Kay's comment,
    matches('G','[A-Z-[f-h]]','i')
and matches('G','[A-Z-[F-H]]','i')
are not well-formed in Perl: the operands of "-" must
be a character, not a range.  Perl does not support
range subtraction directly (see below)...

So, [A-Z-[f-h]] ends up matching the literal [f-h]
and nothing else as far as I can tell.

the example
    matches('G','[A-Z-[F-Hf-h]]','i')
is the same, matching the literal string [F-Hf-h]
(I don't think it's specified that it works this
way, so it's a bug that Perl doesn't trap this case
I think)

The example
    matches('G','[^F-H]','i')
does not match in Perl, neither with nor without the /i

Note that the pattern [A-Z] might or might not match both
a and z: a common collation order on Linux at least for case
insensitive matching is aAbBcCdD...zZ, so A-Z excludes the "a".
This doesn't affect Perl by default, as it uses unicode codepoints
unless you put
    use locale;
in your Perl script (see man pages for perlre and perllocale,
or run "perldoc perlre" to see them...)
"G" does not match /[^G]/i in Perl

Perl's nearest equivalent for range subtraction is the
zero-width negative lookahead assertion, (?!e), which matches
only if it is not immediately followed by something that
matches the contained expression e.  Hence,
/(?![f-h])[A-Z]/i
matches b and w but not g or G.
 
I think the real question here is whether a range can introduce
or exclude unexpected characters when case insensitive.  I experimented,
but the version of Perl I'm using doesn't like ranges in character classes
if they are above codepoint 127 decimal for some reason, although it's
otherwise 8-bit clean, and can match explicit characters in classes.
Comment 6 Michael Kay 2005-08-31 21:34:01 UTC
Mary, I'm having trouble understanding exactly what you mean by:

Likewise for negative character ranges and so on.

That is, you don't mess with the pattern, you check the input string with case 
folding against the pattern as written.

I was originally going to propose a spec which might be what you're suggesting:
Under the "i" flag, a string S matches a regex R if there is some case-variant
S' of S such that S' matches R in the absence of the "i" flag. A string S' is a
case-variant of S if the two strings are the same length and there is a default
case mapping between each pair of corresponding characters in the two strings,
as defined in section 3.13 of [The Unicode Standard].

This rule seems nice and simple, but it doesn't appear to be the same as Java or
Perl, and one must ask whether it is (a) usable, and (b) implementable. It
certainly has some surprises, for example "D" matches "[^D]" (because "d"
matches "[^D]".

I think I will go back to proposing that the tricky cases should be errors. The
rule I propose is: when the "i" flag is used, the regex must not include any of
the following:

* a negative character group
* a character class subtraction
* a category escape (catEsc, complEsc, or charProp)
* any of the multi-character escapes \c, \i, \C, \I
* a back-reference

If any of these is present when the "i" flag is used, error FORGNNNN is raised.

The semantics of the "i" flag is then: A string S matches the regex R under the
"i" flag if there exists a string S' that is a case-variant of S such that S'
matches R in the absence of the "i" flag; with "case-variant" defined as above.

In cases where it is necessary to know which characters matched (for example
when $n appears in the replacement string of fn:replace()), the characters that
matched are those from the original string S, not from S'.

The definition of fn:replace() contains the rule: "If two alternatives within
the pattern both match at the same position in the $input, then the match that
is chosen is the one matched by the first alternative". I think it would be
prudent to relax this rule so that when the "i" flag is used, it is
implementation-dependent which match is chosen. That is, if the input string is
"a" and the regex is "A|a", it's undefined whether the "A" or the "a" is matched.

Michael Kay
Comment 7 Mary Holstege 2005-08-31 22:33:25 UTC
Michael, I think I meant what your "I was going to propose..." text says.
(As usual, of course, you said it better.)

While the apparent difference with Perl and Java might be troubling, I think
what we are discovering here is that they are both woefully underspecified.

I do have at least one data point that the semantics you outline is
implementable in XQuery regular expressions, FWIW.

I am not happy with making these be an error, because we see plenty of scenarios 
where regular expressions are not literals, but constructed, and I think the 
semantics you suggest makes sense, even if it leads to results that may seem 
surprising until you think about it.  I also think that you set of cases is too 
broad: By your rules patterns with such innocuous items as "[^\s] or or "\p{Zs}" 
cause errors in case-insensitive mode. I also don't see why patterns with back
references should be errors. 

It gets to the point where you either have some pretty complex rules about when 
these constructs are "non-confusing" from a case-sensitivity point of view, and 
therefore OK, or you are limiting regular expressions in case-insensitive mode 
almost to the point of uselessness, where the workaround is, with a fair amount 
of pain, to reconstruct essentially the semantics proposed. 

Comment 8 Michael Kay 2005-09-14 12:56:57 UTC
(This is a short proposal, but it's the result of a lot of work - the waste bin
is full of my failed attempts. It's packed with meaning and needs to be read
very carefully, with a close eye on the syntax in Schema Part 2.)


PROPOSAL

The detailed rules for the effect of the "i" flag are as follows. In these
rules, one character is considered to be a *case-variant* of another character
if there is a default case mapping between the two characters as defined in
section 3.13 of [The Unicode Standard]. Note that the case-variants of a
character under this definition are always single characters.

1. When a normal character (Char) is used as an atom, it represents the set
containing that character and all its case-variants. For example, the regular
expression "z" expands to "[zZ]".

2. A character range (charRange) represents the set containing all the
characters that it would match in the absence of the "i" flag, together with
their case-variants. For example, "[A-Z]" expands to "[A-Za-z]". This rule
applies also to a character range used in a character class subtraction
(charClassSub): thus [A-Z-[IO]] expands to [A-Za-z-[IOio]]. It also applies to a
character range used as part of a negative character group: thus [^Q] expands to
[^Qq].

3. A back-reference is compared using case-blind comparison: that is, each
character must either be the same as the corresponding character of the
previously matched string, or must be a case-variant of that character. For
example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
expression "([md])[aeiou]\1" when the "i" flag is used.

4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}"
continues to match upper-case letters only.

  
Michael Kay
Comment 9 Mary Holstege 2005-09-14 16:25:48 UTC
First, I'd like to thank Michael for this proposal. It is certainly clear, and
while there are behaviours that are perhaps unexpected, I think that is
inevitable in this area. 

Acknowledging Michael's comments about the overflowing trashbin (and
contributing a few crumpled sheets there myself), I nevertheless find myself
unhappy with talking about "expanding" the regular expression and 
would prefer to shift to speaking about case-folding as applying to how the
input string is matched. 

From an implementation point of view, expanding regular expressions has
to be done on a case-by-case basis (no pun intended!). While it doesn't make it
impossible to cache regular expressions (i.e. pre-analyze and parse them), 
it does make it trickier and less useful to do so, as the regular expression
itself is no longer a sufficient key to what the analyzed regular expression
is. 

A consequence of this shift would be that case-folding would apply uniformly,
so that, for example: 

    fn:matches( "d", "\p{Lu}", "i" ) = fn:matches( "d", "[A-Z]", "i" )

which is not the case under Michael's proposal. I would go on to argue that
it would be good if both of these were true.  One reason for making this so
is that Datatypes says that "\P{Lu}" == [^\p{Lu}] and therefore you get some
odd inconsistencies if you don't apply the case-folding to the category
escapes as well. 

All of which sums up to putting an obligation on my to come up with a 
counter-proposal. 

My general tack on this is to tweak two statements in XML Schema Datatypes
that define what set of strings a character denotes and what set of strings a
character class denotes.  But I think Michael's case by case exposition is
most excellent and clear, and so I continue with that, tweaking the verbiage to
avoid the "expands" phrasing, treating it as clarification because those two 
rules are sufficient, and adding the additional cases that Michael's proposal 
doesn't touch.

COUNTER-PROPOSAL:

The detailed rules for the effect of the "i" flag are as follows. In these
rules, one character is considered to be a *case-variant* of another character
if there is a default case mapping between the two characters as defined in
section 3.13 of [The Unicode Standard]. Note that the case-variants of a
character under this definition are always single characters.

The rules for regular expressions in [XML Schema Part 2: Datatypes Second
Edition] are modified under the influence of the "i" flag in the following way:

1. A normal character c denotes a set of strings that contains one
   single-character string "x" for each character x that is either c or a
   case-variant of c.  

2. A character class C denotes a set of strings that contains one
   single-character string "x" for each character x that is either in the class
   or is a case-variant of some character in the class. 

Specifically, the application of these rules means:
* When a normal character (Char) is used as an atom, it represents the set
  containing that character and all its case-variants. For example, the regular
  expression "z" matches the same set of characters as "[zZ]".

* A character range (charRange) is a character class, and therefore represents
  the set containing all the characters that it would match in the absence of
  the "i" flag, together with their case-variants. For example, "[A-Z]"
  matches the same set of characters as "[A-Za-z]". 

* A character range used in character class subtraction (charClassSub)
  also represents the set containing all the characters that it would match in
  the absense of the "i" flag, together with their case-variants. For example,
  "[A-Z-[IO]]" matches the same set of characters as "[A-Za-z-[IOio]]".

* A negative character group (negCharGroup) is also a character class and
  the same rule applies. For example, "[^Q]" matches the same set of
  characters as "[^Qq]". 

* A category escape (catEsc) is also a character class and the same rule 
  applies.  For example, "\p{Lu}" matches all the upper case letters and their
  case-variants, and thus the string "d" would match "\p{Lu}".

* A complement category escape (complEsc) is also a character class and the
  same rule applies.  For example, "\P{Lu}" matches all letters that are
  neither upper case nor one of those character's case variants. Therefore
  "d" would not match "\P{Lu}".

* The same rule applies to single-character (SingleCharEsc) and multi-character
  (MultiCharEsc) escapes, although in practice this will have no effect.

* A back-reference is compared using case-blind comparison: that is, each
  character must either be the same as the corresponding character of the
  previously matched string, or must be a case-variant of that character. For
  example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
  expression "([md])[aeiou]\1" when the "i" flag is used.

Comment 10 Michael Kay 2005-09-14 19:12:06 UTC
Use of the word "expand" was perhaps a bit careless. I only used it in examples,
and by saying "A expands to B" I was merely trying to find a shorter way of
saying "A with the i flag set matches the same set of strings as B without the i
flag set". It wasn't intended to describe an algorithm, let alone an
implementation (though I probably had one at the back of my mind).

I appreciate what you're trying to achieve, which I think I can paraphrase as
"if matches(S, P, "") is true, then matches(V(S), P, "i") is true if and only if
V(S) is a case-variant of S." However, I don't think your proposal achieves
this, and in fact I don't think it's a good idea anyway.

I think there are some problems with your proposal. It's not true that a
character range (charRange) is a character class (charClass), and it's not true
that a negative character group is a character class. It is true that "[^Q]" is
a charClass, but if we accept your rule 2, then I think the consequence is that
[^Q] matches every character: in the absence of the "i" flag it matches "q",
therefore in the presence of the "i" flag it also matches "Q". I think the
meaning [^qQ] is more intuitive, and that's why I decided to move the rule down
to the level of a charRange. 

It would be possible to define that a charClassEsc (such as \p{Lu}) matches
case-variants of its "normal" set of strings. The reason I didn't do this was
again to do with complements and subtraction. If you widen \p{Lu} to include
case-variants of its usual characters, do you retain the meaning that \P{Lu} is
the complement of \p{Lu} (in which case it matches a smaller set of characters
than it did before), or do you retain the meaning that it matches all the
characters it would normally match plus their case-variants (a larger set than
before)? I felt it was best to cop out here and say its meaning is unchanged. In
practice, I don't think this is a big problem, because most of the character
blocks already include case-variants of characters, and those that don't, like
Lu and Ll, exclude them very deliberately. 

Michael Kay
Comment 11 Mary Holstege 2005-09-14 19:41:05 UTC
If we rephrase "expands" I'm happier with your proposal, even if we touch 
nothing else, although I'd still prefer to state some general rule rather than
take it by cases, but I could live without doing so.
 
> I think there are some problems with your proposal. It's not true that a
> character range (charRange) is a character class (charClass), and it's not 
true
> that a negative character group is a character class. 

Uh, yes it is. It do say in XML Schema part 2:
[11]   	charClass	   ::=   	charClassEsc | charClassExpr | WildcardEsc
[12]   	charClassExpr	   ::=   	'[' charGroup ']'
[13]   	charGroup	   ::=   	posCharGroup | negCharGroup | charClassSub
[23]   	charClassEsc	   ::=   	( SingleCharEsc | MultiCharEsc | 
catEsc | complEsc )

I can fill in the posCharGroup and negCharGroup and so on, but I think you
get the idea. Everything is a charClass.

I see your point with \p{Lu} and \P{Lu}; let's think about that a bit out loud
to see where we get:

Let just say for abbreviation's sake that normally \p{Lu} denotes the set 
{"A","B"}.  \P{Lu} = [^\p{Lu}] so sayeth Datatypes, so this includes a set
of lots and lots of single-character strings, including "a" and "b".
If instead of using the handy abbreviation \p{Lu} we had spelled it out:
[AB], denoting the set {"A","B"} and the complement would be [^AB], denoting a
set containing lots and lots of single-character strings, including "a" and "b", 
so this is all consistent.

Under the rules of the "i" flag, if we say \p{Lu} means what it means with
other character classes, it denotes the set {"A", "B", "a", "b"}. Following
the equation from Datatypes we get that \P{Lu} denotes a set with lots and
lots of characters but not "a" or "b".  If we had written out \p{Lu} as [AB]
that would also have denoted the set {"A","B","a","b"} and the complement
[^AB] would have also denoted the set with lots and lots of characters but not 
"a" or "b".  So again, this is entirely consistent.

Suppose, however, that under the rules of the "i" flag, we leave \p{Lu} and 
\P{Lu} alone. The \p{Lu} denotes the set {"A","B"}, and \P{Lu} denotes the
set with lots and lots of single character strings including "a" and "b".
If, not knowing this handy abbreviation, I had written out \p{Lu} as [AB], 
I will denote a different set under the "i" flag: {"A","B","a","b"}. Likewise
[^AB] will denote a set that does not include "a" and "b".  

I find this inconsistency pretty baffling to explain, and having to special
case here makes implementation harder.  So I think we should apply the rule
consistently across all character classes.
Comment 12 David Carlisle 2005-09-14 21:28:24 UTC
Both of the recent proposals have had the example

  For example, "[A-Z]" expands to "[A-Za-z]". 

But I think that they would (both) imply

[A-Za-zſK]

If my understanding of the proposals (and
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt)
is correct.



Both of these are listed as Common case mappings
017F; C; 0073; # LATIN SMALL LETTER LONG S
212A; C; 006B; # KELVIN SIGN

Actually I'm fairly sure that the proposals imply that
[a-z] expands to [A-Za-zſK]
(as toLowercase() maps KELVIN SIGN to k)

However 

in the case of  the actual example [A-Z] it depends on the intended meaning of:

   one character is considered to be a *case-variant* of another character
   if there is a default case mapping between the two characters as defined in
   section 3.13 of [The Unicode Standard]. 

There is no case mapping of KELVIN sign into the range A-Z, only into the range
a-z. However it would be pretty strange if [a-z] and [A-Z] did not denote the
same set if i is set, so perhaps a "case variant" needs to be defined such that
two characters are case variants if there are default unicode case mappings that
map the characters to the same character, so K and KELVIN SIGN would be case
variants as they both lower case to k.
Comment 13 Michael Kay 2005-09-14 22:07:25 UTC
Response to Mary:

I said:

* it's not true that a negative character group is a character class. 

You said:
Uh, yes it is. It do say in XML Schema part 2:
[11]   	charClass	   ::=   	charClassEsc | charClassExpr | WildcardEsc
[12]   	charClassExpr	   ::=   	'[' charGroup ']'
[13]   	charGroup	   ::=   	posCharGroup | negCharGroup | charClassSub
[23]   	charClassEsc	   ::=   	( SingleCharEsc | MultiCharEsc | 
catEsc | complEsc )

I can fill in the posCharGroup and negCharGroup and so on, but I think you
get the idea. Everything is a charClass.

I say: oh no it isn't!

A negative character group is a charGroup, and a charGroup *enclosed in square
brackets* is a charClass. But a negative character group on its own, without the
square brackets, is not a charClass.

As regards \P{Lu}, you can maintain either one of two invariants

(a) \P(Lu) == [^\p{Lu}]

(b) if matches("X", P, "") then matches("x", P, "i") for any regex P

but you can't maintain both.

I think your logic is flawed here:

"If we had written out \p{Lu} as [AB]
that would also have denoted the set {"A","B","a","b"} and the complement
[^AB] would have also denoted the set with lots and lots of characters but not 
"a" or "b".  So again, this is entirely consistent."

You're relying here on [^AB] meaning [^ABab]. But under your proposal that's not
what it means. Under your proposal [^AB] matches every character. [^AB] is a
charClass, therefore rule 2 applies, which says

A character class C denotes a set of strings that contains one
   single-character string "x" for each character x that is either in the class
   or is a case-variant of some character in the class. 

If I'm reading that correctly (perhaps I'm not?) you're saying "a" is in the
class [^AB], therefore "A" is also in the class [^AB].

In my proposal I'm breaking invariant (b): I'm saying that [^AB] is a *smaller*
set of characters under the "i" flag than in the absence of the "i" flag. I
think that's the right thing to do. Having already broken that invariant, I'm
then retaining invariant (a) with my proposed treatment of charClassEsc.

Michael Kay
 
Comment 14 Michael Kay 2005-09-14 22:16:17 UTC
David:

Thanks for that comment, which is somewhat orthogonal to the rest of the thread.
I did have slight worries that the definition based on Unicode default case
mappings might be a little problematic in cases where it's non-symmetric (or
non-transitive, etc). Let's get the other stuff sorted and come back to that.
Comment 15 Mary Holstege 2005-09-14 23:41:14 UTC
Michael:

OK, I see now where my logic is flawed, thank you. That makes sense.
Comment 16 Michael Kay 2005-09-15 09:00:12 UTC
Let's now address David's concern about how we define case-variants. I suggest
that rather than appealing directly to Unicode, we instead define it in terms of
our own lower-case() and upper-case() functions (which are themselves defined in
terms of Unicode). This seems to give a better chance of getting them consistent.

The rule that seems to work is:

For characters C1 and C2, considered as strings of length one, C1 is a
case-variant of C2 if (fn:lower-case(C1) eq fn:lower-case(C2) or
fn:upper-case(C1) eq fn:upper-case(C2)) when compared using the Unicode
codepoint collation.

Under this rule, x212A (Kelvin sign) is a case-variant of "k" and also of "K".

So this leads to the revised proposal as follows:

PROPOSAL v2

The detailed rules for the effect of the "i" flag are as follows. In these
rules, one character C2 is considered to be a *case-variant* of another
character C1 if the following XPath expression returns true, when the two
characters are considered as strings of length one, and the Unicode codepoint
collation is used:

fn:lower-case(C1) eq fn:lower-case(C2) 
  or 
fn:upper-case(C1) eq fn:upper-case(C2)

Note that the case-variants of a character under this definition are always
single characters.

1. When a normal character (Char) is used as an atom, it represents the set
containing that character and all its case-variants. For example, the regular
expression "z" will match both "z" and "Z".

2. A character range (charRange) represents the set containing all the
characters that it would match in the absence of the "i" flag, together with
their case-variants. For example, the regular expression "[A-Z]" will match all
the letters A-Z and all the letters a-z. It will also match certain other
characters such as x212A (KELVIN SIGN), since fn:lower-case("&#x212A") is "k". 

This rule applies also to a character range used in a character class
subtraction (charClassSub): thus [A-Z-[IO]] will match characters such as "A",
"B", "a", and "b", but will not match "I", "O", "i", or "o". 

The rule also applies to a character range used as part of a negative character
group: thus [^Q] will match every character except "Q" and "q" (these being the
only case-variants of "Q" in Unicode).

3. A back-reference is compared using case-blind comparison: that is, each
character must either be the same as the corresponding character of the
previously matched string, or must be a case-variant of that character. For
example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
expression "([md])[aeiou]\1" when the "i" flag is used.

4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}"
continues to match upper-case letters only.

  
Michael Kay
Comment 17 David Carlisle 2005-09-15 09:50:57 UTC
> So this leads to the revised proposal as follows:

This works for me.

Only comment is that every character is a case variant of itself, so your rules
1 and 3 can be compressed to

1. When a normal character (Char) is used as an atom, it represents the set
of case-variants of that character. For example, the regular
expression "z" expands to "[zZ]".


3. A back-reference is compared using case-blind comparison: that is, each
character must be a case-variant of the corresponding character of the
previously matched string. For
example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
expression "([md])[aeiou]\1" when the "i" flag is used.



I started to write this comment thinking that the re-write would make things
clearer, highlighting that the characters are treated uniformly and there aren't
really two cases here. However having done it perhaps it relies too much on the
definition and the bit of extra redundancy in your wording is clearer, leave it
to the editors to judge...

 
Comment 18 Liam R E Quin 2005-09-15 15:35:55 UTC
Further information on Perl -- case insensitivity (both in ranges and
elsewhere) only affects a-z and A-Z, and not, for example, e-acute.

This clearly wouldn't work for us!

Liam
Comment 19 Ashok Malhotra 2005-09-27 15:48:53 UTC
The WGs decided on 9/27 to accept Michael Kay's proposal in comment #16.  See below.

The detailed rules for the effect of the "i" flag are as follows. In these
rules, one character C2 is considered to be a *case-variant* of another
character C1 if the following XPath expression returns true, when the two
characters are considered as strings of length one, and the Unicode codepoint
collation is used:

fn:lower-case(C1) eq fn:lower-case(C2) 
  or 
fn:upper-case(C1) eq fn:upper-case(C2)

Note that the case-variants of a character under this definition are always
single characters.

1. When a normal character (Char) is used as an atom, it represents the set
containing that character and all its case-variants. For example, the regular
expression "z" will match both "z" and "Z".

2. A character range (charRange) represents the set containing all the
characters that it would match in the absence of the "i" flag, together with
their case-variants. For example, the regular expression "[A-Z]" will match all
the letters A-Z and all the letters a-z. It will also match certain other
characters such as x212A (KELVIN SIGN), since fn:lower-case("&#x212A") is "k". 

This rule applies also to a character range used in a character class
subtraction (charClassSub): thus [A-Z-[IO]] will match characters such as "A",
"B", "a", and "b", but will not match "I", "O", "i", or "o". 

The rule also applies to a character range used as part of a negative character
group: thus [^Q] will match every character except "Q" and "q" (these being the
only case-variants of "Q" in Unicode).

3. A back-reference is compared using case-blind comparison: that is, each
character must either be the same as the corresponding character of the
previously matched string, or must be a case-variant of that character. For
example, the strings "Mum", "mom", "Dad", and "DUD" all match the regular
expression "([md])[aeiou]\1" when the "i" flag is used.

4. All other constructs are unaffected by the "i" flag. For example, "\p{Lu}"
continues to match upper-case letters only.