This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1922 - 'x' regex flag not entirely clear
Summary: 'x' regex flag not entirely clear
Status: CLOSED FIXED
Alias: None
Product: XPath / XQuery / XSLT
Classification: Unclassified
Component: Functions and Operators 1.0 (show other bugs)
Version: Last Call drafts
Hardware: PC Windows 2000
: P2 normal
Target Milestone: ---
Assignee: Ashok Malhotra
QA Contact: Mailing list for public feedback on specs from XSL and XML Query WGs
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-31 15:26 UTC by Mary Holstege
Modified: 2005-09-29 12:55 UTC (History)
0 users

See Also:


Attachments

Description Mary Holstege 2005-08-31 15:26:24 UTC
Section 7.6.1.1 of F&O says only this about the 'x' flag:
"x: If present, whitespace characters within the regular expression are ignored. 
By default, whitespace characters match themselves. This allows, for example, 
regular expressions to be broken up into lines for readability."

Our implementors ask for clarification of what 'ignored' means. Here are some
cases:

fn:matches("helloworld", "hello[ ]world", "x")
   Error? (because [] is not a valid character set?) Or true()?
fn:matches("hello world", "hello\ sworld", "x")
   True or false? That is is '\ s' == '\s'?
And so forth for spaces in other odd places:
"(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)\1 0" 
     \1 followed by '0' or \10?
"\p{ Lu}" 
"\p{L u}" 
"[a- ]"
"[a- z]" 
"hello\ "
"[ ^a]"
"[^ ]"

We assume the appropriate semantic is to pre-strip all whitespace and then parse
the resulting regex; this is certainly simpler from an implementation 
standpoint, but "ignore" isn't entirely clear and could me to ignore in 
matching, not parsing.
Comment 1 Michael Kay 2005-08-31 21:36:15 UTC
I agree with Mary's suggestion: the "x" flag should cause all whitespace to be
stripped from the regex in an initial pass, and the semantics are then those of
the resulting regex after whitespace-removal.

Michael Kay
Comment 2 Liam R E Quin 2005-09-01 00:43:26 UTC
The Perl documentation is useful here (it had to happen once).

The "perldoc perlre" page says,
[[
The "/x" modifier itself needs a little more explanation.  It tells the
regular expression parser to ignore whitespace that is neither back‐
slashed nor within a character class.  You can use this to break up
your regular expression into (slightly) more readable parts.  The "#"
character is also treated as a metacharacter introducing a comment,
just as in ordinary Perl code.  This also means that if you want real
whitespace or "#" characters in the pattern (outside a character class,
where they are unaffected by "/x"), that you'll either have to escape
them or encode them using octal or hex escapes.  Taken together, these
features go a long way towards making Perl's regular expressions more
readable. 
]]

I believe this is a sensible and appropriate definition, and
means that [ ] matches a single space (and also, for Perl,
that you can't put comments inside character classes).

It's not clear to me how to allow host-language comments
inside a regular expression, and I think that should be up
to the host language to specify, rather than using Perl's
# comments.  So just take the whitespace part of this.

Liam
Comment 3 Ashok Malhotra 2005-09-13 21:32:35 UTC
On the joint 9/13 telcon the WGs agreed to change the explanation of the 'x'
flag based on the Perl semnatics as suggested by Liam.  Suggested replacement
text is below.  Please comment.

x: If present, whitespace characters in the regex are removed prior to matching
with two exceptions:  whitespace characters preceded by a backslash are not
removed and whitespace characters within character classes are not removed. 
This can be used, for example to break up long regex' into readable lines.

Examples:
fn:matches("helloworld", "hello world", "x") returns true
fn:matches("helloworld", "hello[ ]world", "x") returns false
fn:matches("hello world", "hello\ sworld", "x") returns false
fn:matches("hello world", "hello\sworld", "x") returns true
Comment 4 Michael Kay 2005-09-13 21:58:41 UTC
Actually, I don't believe our syntax allows backslash to be followed by a
whitespace character. There's little point in preserving the whitespace
character if it's illegal, so I suggest we strip it.

A character class (charClass) is either a charClassExpr or a charClassEsc. Since
charClassEsc embraces things like \P{IsCombiningDiacriticalMarks} I think that
it's only within a Character Class Expression (charClassExpr) that you wanted
whitespace to be preserved.

Michael Kay
Comment 5 Ashok Malhotra 2005-09-14 13:26:48 UTC
Amended proposal based on Michael Kay observation that whitespace characters are
not allowed after the backslash and so we should strip them out if they do occur.

REVISED PROPOSAL

x: If present, whitespace characters in the regex are removed prior to matching
with one exception:  whitespace characters within character class expressions
(charClassExpr) are not removed. This can be used, for example, to break up long
regex' into readable lines.

Examples:
fn:matches("helloworld", "hello world", "x") returns true

fn:matches("helloworld", "hello[ ]world", "x") returns false

fn:matches("hello world", "hello\ sworld", "x") returns true
Comment 6 Michael Kay 2005-09-14 13:50:22 UTC
Another useful example might be:

fn:matches("hello world", "hello world", "x") returns false

Michael Kay
Comment 7 Ashok Malhotra 2005-09-27 15:29:53 UTC
The WGs decided on 9/27 to accept the proposal below to fix this bug.

x: If present, whitespace characters in the regex are removed prior to matching
with one exception:  whitespace characters within character class expressions
(charClassExpr) are not removed. This can be used, for example, to break up long
regex' into readable lines.

Examples:
fn:matches("helloworld", "hello world", "x") returns true

fn:matches("helloworld", "hello[ ]world", "x") returns false

fn:matches("hello world", "hello\ sworld", "x") returns true

fn:matches("hello world", "hello world", "x") returns false