Re: [ACTION-385] Common regular expression syntax from Phil Ritchie on 2013-01-28 (public-multilingualweb-lt@w3.org from January 2013)

From: Phil Ritchie <philr@vistatec.ie>
Date: Mon, 28 Jan 2013 07:56:39 +0000
To: "Yves Savourel" <ysavourel@enlaso.com>
Cc: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
Message-ID: <07F66133-20CC-4AC6-9ED5-FF294AD2EBF5@vistatec.ie>
Indeed, a thorough assessment! This looks satisfactory to me too.

Phil



On 27 Jan 2013, at 22:31, "Yves Savourel" <ysavourel@enlaso.com> wrote:

> Hi Shaun,
>
> Thanks for the thorough analysis.
> That should be enough for the goals of the data category.
>
> cheer,
> -yves
>
>
> -----Original Message-----
> From: Shaun McCance [mailto:shaunm@gnome.org]
> Sent: Sunday, January 27, 2013 10:30 AM
> To: public-multilingualweb-lt@w3.org
> Subject: [ACTION-385] Common regular expression syntax
>
> I've investigated features in six different regular expression dialects
to try to find a safe common subset for the allowed characters data
category. I tested Java, .Net, XSD, JavaScript, Perl, and Python. I still
want to test POSIX EREs, and PHP may be good to test as well, given the
focus on CMSs in 2.0. But I think the subset from the six I tested is going
to be safe in general.
>
> I only tested RE features found inside character classes, i.e.
> stuff between '[' and ']'. Everything was tested on Fedora 14.
> Java with OpenJDK 1.6.0. .Net with Mono 2.6.7. JavaScript with Firefox
3.6.12. XSD with libxml2 2.7.7. Perl 5.12.2. Python 2.7.
>
> Notes: For Perl I had to use "use utf8;" to get anything beyond ASCII to
work right. For Python, I had to use Unicode objects with the u'' notation
to get non-ASCII right. Python 3 probably does this better, but I didn't
test it.
>
> On with it:
>
>
> locale-dependant
>
> Unicode Classes (\p{L}, etc) are supported by everything but JavaScript.
Perl and Java recognize some shorthand classes, but I believe there's a
base set that's compatible, except for JavaScript.
>
> POSIX character classes ([:digit:], etc) are only supported by Perl and
POSIX tools like grep. They're out.
>
> Escaping "^": In all dialects, you can escape "^" with "\".
> In everything but XSD, you can just put "^" somewhere other than the
beginning of the class. This sounds insane to me.
>
> Escaping "]": In all dialects, you can escape "]" with "\".
> In everything but XSD and JavaScript, you can use it as the first
character.
>
> \Q...\E expressions are only supported in Java and Perl.
>
> In all dialects, "-" can be escaped with a "\" or by putting it at the
beginning or end of the character class.
>
> Character class substitutions (e.g. [a-z-[aeiou]]) are only supported in
XSD. I read that .Net supports them, but that didn't pan out in my tests.
It could be a newer addition to the standard, or it could be that Mono is
buggy (rare).
>
> Octal escapes (e.g. \135 for "]") are supported in Python, Perl, and
JavaScript. Hex escapes (e.g. \x5D for "]") are supported in all but XSD.
Unicode escapes (e.g. \u2234 for
> "∴") are supported in all but XSD and Perl. This one makes me sad. I
really wish we could use Unicode escapes safely.
>
> Everything supports "\n", "\r", and "\t".
>
> XSD and JavaScript don't support "\a" for U+0007 BELL. XSD, JavaScript,
and Python don't support "\e" for U+001B ESCAPE.
> Not big losses. XSD is the only dialect that doesn't support "\f" for U
+000C FORM FEED.
>
> Java, .Net, JavaScript, and Python support "\v" to match
> U+000B LINE TABULATION only. In Perl, "\v" matches anything
> it calls vertical whitespace, U+000A through U+000D. True to fashion, XSD
doesn't support "\v" at all.
>
> Control code escapes: \cA through \cZ means U+0001 through
> U+001A in Java, .Net, JavaScript, and Perl. In XSD, the \c
> escape means something entirely different. Python doesn't support \c.
>
> Every dialect seems to support \d and \D and agree on what they actually
mean.
>
> In XSD and .Net, \w matches lots of Unicode word characters.
> In the others, it matches [A-Za-z0-9_]. Although at least for Python, the
documentation says it's locale-dependent.
> See my note on that below. They all support \W, with the same
compatibility problem.
>
> Every dialect support \s for whitespace, but they all have different
definitions of whitespace. In XSD, \s matches space, tab, carriage return,
line feed. In Java, Perl, and Python, \s matches those plus vertical tab
and form feed.
> In JavaScript, \s matches all sorts of Unicode whitespace characters,
like non-breaking spaces and zero-width spaces.
> They all support \S, with the same compatibility problem.
>
> In some dialects, some behavior changes based on locale.
> This is dangerous, and I believe we should avoid all such behavior. To
the extent that it's useful, it should not be based on the locale the
program is running in, but on the locale you're translating to or
(possibly) from. And for the latter, we'd have to define an interaction
with langPointer.
>
> So what I think this leaves us with is character classes [abc], ranges
[a-c], and negations [^abc], there "^" and "]" must never appear unless
backslash-escaped, "-" may be backslash-escaped or put at the beginning or
end, the escape sequences "\n", "\r", "\t", "\d", and "\D" may be used, and
literal "\" is escaped as "\\".
>
> Importantly, you must never have an unescaped backslash, because some
dialects may treat it as the beginning of an escape sequence that means
something special.
>
> This is a very limited subset, but I think it's what we have to use. I'm
now going to try to make a portable RE that matches these portable RE
character classes.
>
> Comments?
>
> --
> Shaun
>
>
>
>
>


************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the sender immediately by e-mail.

www.vistatec.com
************************************************************
Received on Monday, 28 January 2013 07:57:10 UTC