1922 – 'x' regex flag not entirely clear

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 1922 - 'x' regex flag not entirely clear

Summary: 'x' regex flag not entirely clear

Status:	CLOSED FIXED

Alias:	None

Product:	XPath / XQuery / XSLT
Classification:	Unclassified
Component:	Functions and Operators 1.0 (show other bugs)
Version:	Last Call drafts
Hardware:	PC Windows 2000

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ashok Malhotra
QA Contact:	Mailing list for public feedback on specs from XSL and XML Query WGs

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-08-31 15:26 UTC by Mary Holstege
Modified:	2005-09-29 12:55 UTC (History)
CC List:	0 users

See Also:

Attachments

Description Mary Holstege 2005-08-31 15:26:24 UTC

Section 7.6.1.1 of F&O says only this about the 'x' flag:
"x: If present, whitespace characters within the regular expression are ignored. 
By default, whitespace characters match themselves. This allows, for example, 
regular expressions to be broken up into lines for readability."

Our implementors ask for clarification of what 'ignored' means. Here are some
cases:

fn:matches("helloworld", "hello[ ]world", "x")
   Error? (because [] is not a valid character set?) Or true()?
fn:matches("hello world", "hello\ sworld", "x")
   True or false? That is is '\ s' == '\s'?
And so forth for spaces in other odd places:
"(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)\1 0" 
     \1 followed by '0' or \10?
"\p{ Lu}" 
"\p{L u}" 
"[a- ]"
"[a- z]" 
"hello\ "
"[ ^a]"
"[^ ]"

We assume the appropriate semantic is to pre-strip all whitespace and then parse
the resulting regex; this is certainly simpler from an implementation 
standpoint, but "ignore" isn't entirely clear and could me to ignore in 
matching, not parsing.

Comment 1 Michael Kay 2005-08-31 21:36:15 UTC

I agree with Mary's suggestion: the "x" flag should cause all whitespace to be
stripped from the regex in an initial pass, and the semantics are then those of
the resulting regex after whitespace-removal.

Michael Kay

Comment 2 Liam R E Quin 2005-09-01 00:43:26 UTC

The Perl documentation is useful here (it had to happen once).

The "perldoc perlre" page says,
[[
The "/x" modifier itself needs a little more explanation.  It tells the
regular expression parser to ignore whitespace that is neither back&#8208;
slashed nor within a character class.  You can use this to break up
your regular expression into (slightly) more readable parts.  The "#"
character is also treated as a metacharacter introducing a comment,
just as in ordinary Perl code.  This also means that if you want real
whitespace or "#" characters in the pattern (outside a character class,
where they are unaffected by "/x"), that you'll either have to escape
them or encode them using octal or hex escapes.  Taken together, these
features go a long way towards making Perl's regular expressions more
readable. 
]]

I believe this is a sensible and appropriate definition, and
means that [ ] matches a single space (and also, for Perl,
that you can't put comments inside character classes).

It's not clear to me how to allow host-language comments
inside a regular expression, and I think that should be up
to the host language to specify, rather than using Perl's
# comments.  So just take the whitespace part of this.

Liam

Comment 3 Ashok Malhotra 2005-09-13 21:32:35 UTC

On the joint 9/13 telcon the WGs agreed to change the explanation of the 'x'
flag based on the Perl semnatics as suggested by Liam.  Suggested replacement
text is below.  Please comment.

x: If present, whitespace characters in the regex are removed prior to matching
with two exceptions:  whitespace characters preceded by a backslash are not
removed and whitespace characters within character classes are not removed. 
This can be used, for example to break up long regex' into readable lines.

Examples:
fn:matches("helloworld", "hello world", "x") returns true
fn:matches("helloworld", "hello[ ]world", "x") returns false
fn:matches("hello world", "hello\ sworld", "x") returns false
fn:matches("hello world", "hello\sworld", "x") returns true

Comment 4 Michael Kay 2005-09-13 21:58:41 UTC

Actually, I don't believe our syntax allows backslash to be followed by a
whitespace character. There's little point in preserving the whitespace
character if it's illegal, so I suggest we strip it.

A character class (charClass) is either a charClassExpr or a charClassEsc. Since
charClassEsc embraces things like \P{IsCombiningDiacriticalMarks} I think that
it's only within a Character Class Expression (charClassExpr) that you wanted
whitespace to be preserved.

Michael Kay

Comment 5 Ashok Malhotra 2005-09-14 13:26:48 UTC

Amended proposal based on Michael Kay observation that whitespace characters are
not allowed after the backslash and so we should strip them out if they do occur.

REVISED PROPOSAL

x: If present, whitespace characters in the regex are removed prior to matching
with one exception:  whitespace characters within character class expressions
(charClassExpr) are not removed. This can be used, for example, to break up long
regex' into readable lines.

Examples:
fn:matches("helloworld", "hello world", "x") returns true

fn:matches("helloworld", "hello[ ]world", "x") returns false

fn:matches("hello world", "hello\ sworld", "x") returns true

Comment 6 Michael Kay 2005-09-14 13:50:22 UTC

Another useful example might be:

fn:matches("hello world", "hello world", "x") returns false

Michael Kay

Comment 7 Ashok Malhotra 2005-09-27 15:29:53 UTC

The WGs decided on 9/27 to accept the proposal below to fix this bug.

x: If present, whitespace characters in the regex are removed prior to matching
with one exception:  whitespace characters within character class expressions
(charClassExpr) are not removed. This can be used, for example, to break up long
regex' into readable lines.

Examples:
fn:matches("helloworld", "hello world", "x") returns true

fn:matches("helloworld", "hello[ ]world", "x") returns false

fn:matches("hello world", "hello\ sworld", "x") returns true

fn:matches("hello world", "hello world", "x") returns false