1922 2005-08-31 15:26:24 +0000 'x' regex flag not entirely clear 2005-09-29 12:55:59 +0000 1 1 1 Unclassified XPath / XQuery / XSLT Functions and Operators 1.0 Last Call drafts PC Windows 2000 CLOSED FIXED P2 normal --- 1 holstege ashok.malhotra public-qt-comments oldest_to_newest 5607 0 holstege 2005-08-31 15:26:24 +0000 Section 7.6.1.1 of F&O says only this about the 'x' flag: "x: If present, whitespace characters within the regular expression are ignored. By default, whitespace characters match themselves. This allows, for example, regular expressions to be broken up into lines for readability." Our implementors ask for clarification of what 'ignored' means. Here are some cases: fn:matches("helloworld", "hello[ ]world", "x") Error? (because [] is not a valid character set?) Or true()? fn:matches("hello world", "hello\ sworld", "x") True or false? That is is '\ s' == '\s'? And so forth for spaces in other odd places: "(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)(a|b)\1 0" \1 followed by '0' or \10? "\p{ Lu}" "\p{L u}" "[a- ]" "[a- z]" "hello\ " "[ ^a]" "[^ ]" We assume the appropriate semantic is to pre-strip all whitespace and then parse the resulting regex; this is certainly simpler from an implementation standpoint, but "ignore" isn't entirely clear and could me to ignore in matching, not parsing. 5622 1 mike 2005-08-31 21:36:15 +0000 I agree with Mary's suggestion: the "x" flag should cause all whitespace to be stripped from the regex in an initial pass, and the semantics are then those of the resulting regex after whitespace-removal. Michael Kay 5635 2 liam 2005-09-01 00:43:26 +0000 The Perl documentation is useful here (it had to happen once). The "perldoc perlre" page says, [[ The "/x" modifier itself needs a little more explanation. It tells the regular expression parser to ignore whitespace that is neither back‐ slashed nor within a character class. You can use this to break up your regular expression into (slightly) more readable parts. The "#" character is also treated as a metacharacter introducing a comment, just as in ordinary Perl code. This also means that if you want real whitespace or "#" characters in the pattern (outside a character class, where they are unaffected by "/x"), that you'll either have to escape them or encode them using octal or hex escapes. Taken together, these features go a long way towards making Perl's regular expressions more readable. ]] I believe this is a sensible and appropriate definition, and means that [ ] matches a single space (and also, for Perl, that you can't put comments inside character classes). It's not clear to me how to allow host-language comments inside a regular expression, and I think that should be up to the host language to specify, rather than using Perl's # comments. So just take the whitespace part of this. Liam 6121 3 ashok.malhotra 2005-09-13 21:32:35 +0000 On the joint 9/13 telcon the WGs agreed to change the explanation of the 'x' flag based on the Perl semnatics as suggested by Liam. Suggested replacement text is below. Please comment. x: If present, whitespace characters in the regex are removed prior to matching with two exceptions: whitespace characters preceded by a backslash are not removed and whitespace characters within character classes are not removed. This can be used, for example to break up long regex' into readable lines. Examples: fn:matches("helloworld", "hello world", "x") returns true fn:matches("helloworld", "hello[ ]world", "x") returns false fn:matches("hello world", "hello\ sworld", "x") returns false fn:matches("hello world", "hello\sworld", "x") returns true 6122 4 mike 2005-09-13 21:58:41 +0000 Actually, I don't believe our syntax allows backslash to be followed by a whitespace character. There's little point in preserving the whitespace character if it's illegal, so I suggest we strip it. A character class (charClass) is either a charClassExpr or a charClassEsc. Since charClassEsc embraces things like \P{IsCombiningDiacriticalMarks} I think that it's only within a Character Class Expression (charClassExpr) that you wanted whitespace to be preserved. Michael Kay 6124 5 ashok.malhotra 2005-09-14 13:26:48 +0000 Amended proposal based on Michael Kay observation that whitespace characters are not allowed after the backslash and so we should strip them out if they do occur. REVISED PROPOSAL x: If present, whitespace characters in the regex are removed prior to matching with one exception: whitespace characters within character class expressions (charClassExpr) are not removed. This can be used, for example, to break up long regex' into readable lines. Examples: fn:matches("helloworld", "hello world", "x") returns true fn:matches("helloworld", "hello[ ]world", "x") returns false fn:matches("hello world", "hello\ sworld", "x") returns true 6125 6 mike 2005-09-14 13:50:22 +0000 Another useful example might be: fn:matches("hello world", "hello world", "x") returns false Michael Kay 6497 7 ashok.malhotra 2005-09-27 15:29:53 +0000 The WGs decided on 9/27 to accept the proposal below to fix this bug. x: If present, whitespace characters in the regex are removed prior to matching with one exception: whitespace characters within character class expressions (charClassExpr) are not removed. This can be used, for example, to break up long regex' into readable lines. Examples: fn:matches("helloworld", "hello world", "x") returns true fn:matches("helloworld", "hello[ ]world", "x") returns false fn:matches("hello world", "hello\ sworld", "x") returns true fn:matches("hello world", "hello world", "x") returns false