RE: [ACTION-385] Common regular expression syntax

Hi Shaun,

Many thanks for this Shaun.

I've added it to our ITS processing to check the its:allowedCharacters value and noticed that some of the test files have the expression "[^*+]" which seems to be not valid based this checking expression. (I still have to make sure my validation code is right).

Is that the case? If yes, how would we express "any chars but '*' and '+'"?

cheers,
-yves

-----Original Message-----
From: Shaun McCance [mailto:shaunm@gnome.org] 
Sent: Monday, February 04, 2013 11:53 AM
To: public-multilingualweb-lt@w3.org
Subject: Re: [ACTION-385] Common regular expression syntax

On Sun, 2013-01-27 at 12:30 -0500, Shaun McCance wrote:
> So what I think this leaves us with is character classes [abc], ranges 
> [a-c], and negations [^abc], there "^" and "]" must never appear 
> unless backslash-escaped, "-" may be backslash-escaped or put at the 
> beginning or end, the escape sequences "\n", "\r", "\t", "\d", and 
> "\D" may be used, and literal "\" is escaped as "\\".
> 
> Importantly, you must never have an unescaped backslash, because some 
> dialects may treat it as the beginning of an escape sequence that 
> means something special.
> 
> This is a very limited subset, but I think it's what we have to use. 
> I'm now going to try to make a portable RE that matches these portable 
> RE character classes.

Upon further investigation, it seems some engines allow Unicode characters outside 0-9 for \d, so that's out too. There's an open question of what characters can be referred to. I decided to use the definition of Char in XML 1.0:

http://www.w3.org/TR/REC-xml/#charsets

It's hard to reference these, because many of the range boundary characters are unassigned, so effectively unprintable. I think we don't want to embed the literal character U+D7FF in the spec.

Here is the proposed regular expression escaped with XML numeric character entities, as if it were put into an XML document:

^(\.|
\[^?-?(([	

 -,&#x2E-[_-퟿-�𐀀-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([	

 -,&#x2E-[_-퟿-�𐀀-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$

(Email will almost certainly add line breaks. Ignore them.)

There are two ways I know of to escape characters (not bytes) in different engines: \x{2234} and \u2234. The \u syntax can only reference Plane 1 characters, and works in everything except XSD and Perl/PCRE. The \x{} syntax is only Perl/PCRE, but can specify any character.

Here it is with \x{}, for Perl/PCRE only:

^(\.|
\[^?-?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$

And here is a regular expression that matches a subset of our subset, limited to Plane 1, with the \u escape:

^(\.|\[^?-?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005B\u005F-\uD7FF
\uE000-\uFFFD]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\u0009\u000A\u000D
\u0020-\u002C\u002E-\u005B\u005F-\uD7FF\uE000-\uFFFD\u10000-#x10FFFF]|\
\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$

And remember, the backslashes and escaped backslashes are significant to the regular expression engine. If you're putting that into a string in a language like Java or C#, you need to escape the escapes:

re = new Regex("^(\\.|\\[^?-?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\
\u002E-\\u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\
\\^|\\\\-|\\\\\\\\)(-([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\
\u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\\\^|\\\
\-|\\\\\\\\))?)+-?\\])?$");

--
Shaun

Received on Monday, 4 February 2013 20:47:10 UTC