RE: [Issue-67] [Action-385] Work on regex for validating regex subset proposal from Pablo Nieto Caride on 2013-04-08 (public-multilingualweb-lt@w3.org from April 2013)

From: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
Date: Mon, 8 Apr 2013 12:59:11 +0200
To: "'Felix Sasaki'" <fsasaki@w3.org>, "'Jirka Kosek'" <jirka@kosek.cz>
Cc: <public-multilingualweb-lt@w3.org>
Message-ID: <06fb01ce3448$19fd6690$4df833b0$@linguaserve.com>
Sorry! forgot to add support to \n \r \t \s etc... here is the regex corrected:
((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)*|(\\w|\\n|\\r|\\t|\\s)*

And here the complex one corrected:
((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w|\\n|\\r|\\t|\\s)*

Cheers,
Pablo.
__________________________________

-----Mensaje original-----
De: Pablo Nieto Caride [mailto:pablo.nieto@linguaserve.com] 
Enviado el: lunes, 08 de abril de 2013 12:45
Para: 'Felix Sasaki'; 'Jirka Kosek'
CC: public-multilingualweb-lt@w3.org
Asunto: RE: [Issue-67] [Action-385] Work on regex for validating regex subset proposal

Hi Felix, all,

The ABNF seems not to be a bad approach, in any case I have reworked the regex (the markers ^and $ at the beginning and the end does not seem to work with XSD) and now it's ok. I did what Felix suggested and changed my its20-types.rng and run ant validate-xml and it worked. Here are the changes and the new regex.
  <define name="its-allowedCharacters.type">
    <data type="string">
  <param name="pattern">((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)*|(\\w)*</param>
    </data>
  </define>

It covers everything except for nested character classes such as [a-d[^c]] which are not widely used and most engines does not support them. If we go on with the previous regex we would have to drop the examples, "[&#x20;-&#x1ffff;-[&lt;>:&quot;\\/|\?*]]" : allows only the characters valid for Windows file names.
and
"[a-&#x00ff;-[\s]]" : allows all characters between U+0061 and U+00FF except the characters SPACE (U+0020), TABULATION (U+0009), CARRIAGE RETURN (U+000D) and LINE FEED (U+000F).
from the specification I imagine, otherwise here is a regex that covers everything but it's huge:
((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w)*

And one last thing ranges such as [a-f-[z]] seem not be very valid since [a-z] it's the same and better.

By the way, Jirka when trying to validate the files the jing.jar of the Test Suite repository didn't work to me, I had to copy the one from your repository html5-its-tools, can anyone confirm this?

Cheers,
Pablo.
__________________________________

Am 06.04.13 22:45, schrieb Felix Sasaki:
> Hi Pablo, all,
>
> I had a look at the test suite again and found this kinds of regexes:
>
> [a-zA-Z_\-]
> [^*+]
> [ &#xFF01;–&#xFF5E;]
> [&#x0020;-&#x00FE;]
> [^*+]
>
> Maybe it would help to do the ABNF approach that Pablo mentioned

Ups, sorry, I meant "that Jirka mentioned".

- Felix

> and restrict us with that. See an ABNF below.
>
> ========
> allowedCharacters = start 1*range end ["+"]
>
> start = "["
>
> end = "]"
>
> range = char / char "-" char
>
> char = [neg] BMP+escapes
>
> neg = "^"
>
> ========
>
> This means: the regex must always start with "[" and end with "]". In 
> the brackets there must be at least one range. The range can be just 
> one or more characters or a range in the form of character "-" character.
> The character is "char" which optionally can be forbidden via "^". 
> BMP+escapes then is the Unicode BMP, including the escapes of
> characters like "[", "]", "-" etc.
>
> This is more restricted than what Shaun proposed at 
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/
> 0180.html
>
> but looking at the test suite and the use case of allowed characters 
> it seems to cover everything.
>
> Using the ABNF would not mean to drop the regex. started working on an 
> XML Schema / RELAX NG regex implementing above ABNF, and it looks 
> pretty straightforward.
>
> Thoughts?
>
> Best,
>
> Felix
>
>
> Am 06.04.13 19:18, schrieb Pablo Nieto Caride:
>> Hi Felix, all,
>>
>>
>> On Apr 6, 2013, at 1:25 PM, Felix Sasaki <fsasaki@w3.org> wrote:
>>
>>> Hi Pablo,
>>>
>>> sorry for the effort, but to move this forward, we need at least 
>>> make sure that at least the test suite reg ex examples work.
>>>
>>> I checked
>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input
>>> data/allowedcharacters/xml
>>>
>>> by replacing in my local copy of the test suite 
>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/schem
>>> a/its20-types.rng
>>>
>>> this part
>>>   <define name="its-allowedCharacters.type">
>>>     <data type="string"></data>
>>>   </define>
>>> with this, that is inside the "pattern" element your regex for XML
>>> validation:
>>>     <data type="string">
>>>       <param
>>> name="pattern">^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$</param>
>>>     </data>
>>>
>>> That gave me validation errors like this one:
>>>
>>> [jing]
>>> /its2.0/inputdata/allowedcharacters/xml/allowedcharacters7xmlrules.xml:3:100: 
>>> error: Bad value ?[^*+]? for attribute ?allowedCharacters? on 
>>> element ?allowedCharactersRule? from namespace 
>>> ?http://www.w3.org/2005/11/its?.
>>>
>>> Could you change in your local copy of the test suite the "param" 
>>> element with your regex so that the validation for all test suite 
>>> files for allowed characters 
>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input
>>> data/allowedcharacters/
>>>
>>> works?
>>>
>>> FYI, the content is an XML Schema regular expression, so your XML 
>>> version for validation should work finally, I think.
>>>
>>> Again, sorry for the effort, but it would be great to have this done 
>>> before the next publication, that is by Thursday next week. Would 
>>> that work for you?
>>>
>> I doing some testing with the files you sent me to see how XSD works 
>> with regex and I'm seeing weird things, like problems with ^ and $ to 
>> set the beginning and end of the regex, I'm still working on it.
>>
>> I will do as you say and change my local copy of the schema to 
>> validate the Test Suite files.
>>
>> Yes I think it'll work for me, there is time and I think I'm close to 
>> the solution. Sorry but it got more complicated than I initially 
>> expected.
>>
>> Cheers,
>> Pablo.
>>
>>> Best,
>>>
>>> Felix
>>>
>>> Am 05.04.13 15:39, schrieb Pablo Nieto Caride:
>>>> Hi Felix,
>>>>
>>>> Yes I tried Allowed Characters Test-Suite's  example before to make 
>>>> sure that the regex worked, and [a-zA-Z_\-] works for me in my 
>>>> system, anyway I'll try what you suggest and get back to you as 
>>>> soon as I have the results.
>>>>
>>>> Cheers,
>>>> Pablo.
>>>> __________________________________
>>>>
>>>> Hi Pablo, all,
>>>>
>>>> Am 05.04.13 11:24, schrieb Pablo Nieto Caride:
>>>>> Hi all,
>>>>>
>>>>> I have completed the regex. Finally I decided to restrict it to 
>>>>> Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is 
>>>>> sufficient and otherwise the regex would be very complex, besides 
>>>>> Shaun didn't actually limit it to Plane 1 (Supplementary 
>>>>> Multilingual Plane 10000–1FFFF) but to Planes 15-16 (10FFFF) 
>>>>> which is too much. I understand it covers the basics (now escapes 
>>>>> of [, ], ^ and -) and does not match incorrect regex, such as 
>>>>> "[f-", supports the greedy and lazy wildcard (this is not really 
>>>>> necessary), and does not support nested character classes (do we 
>>>>> need them? They are rarely used in general). Please test it:
>>>>> 1) Here is the proposed regular expression escaped with XML 
>>>>> numeric character entities, as if it were put into an XML document:
>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x
>>>>> 2C;&
>>>>>
>>>>> #x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)
>>>>> |(\\
>>>>>
>>>>> \-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;
>>>>> -&#x
>>>>>
>>>>> D7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>> I tried that with an [a-zA-Z_\-]
>>>> but got a validation error. Could you check a few examples from 
>>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/inpu
>>>> tdata/allowedcharacters/html/ to make sure that the regex works? 
>>>> E.g. by creating a schema like the attached one and check with the 
>>>> regex?
>>>>
>>>>
>>>> Best,
>>>>
>>>> Felix
>>>>> 2) Here it is with \x{}, for Perl/PCRE only:
>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{
>>>>> 2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\
>>>>> \\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5
>>>>> A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\
>>>>> \))+)*\]))*$
>>>>>
>>>>>
>>>>> 3) And here is a regular expression that matches a subset of our 
>>>>> subset, limited to Plane 0, with the \u escape:
>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u0
>>>>> 02C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(
>>>>> \\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u00
>>>>> 5F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>>>
>>>>>
>>>>> 4) And remember, the backslashes and escaped backslashes are 
>>>>> significant to the regular expression engine. If you're putting 
>>>>> that into a string in a language like Java or C#, you need to 
>>>>> escape the escapes:
>>>>> re = new
>>>>> Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u00
>>>>> 0A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uF
>>>>> FFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009
>>>>> \\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000
>>>>> -\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))*
>>>>> $");
>>>>>
>>>>> I'll proceed now to draft text explaining importance of Unicode 
>>>>> normalization and best practices, that's Action-430.
>>>>>
>>>>> Cheers,
>>>>> Pablo.
>>>>> __________________________________
>>>>>
>>>>> Hi Jirka,
>>>>>
>>>>> It should not match invalid expressions since it only support 
>>>>> character classes, ranges and negations, but still needs a bit of 
>>>>> polishing regarding escapes. I don't think we need a BNF grammar, 
>>>>> but it's not mine to decide, I just doing what I'm supposed to.
>>>>>
>>>>> Cheers,
>>>>> Pablo.
>>>>> __________________________________
>>>>>
>>>>> On 4.4.2013 17:12, Pablo Nieto Caride wrote:
>>>>>> Please, implementers and whoever that is interested, give 
>>>>>> feedback if necessary so I can move forward and evolve the regex.
>>>>> Hi,
>>>>>
>>>>> since such complex regular expressions are mostly write-only (it's 
>>>>> very hard to understand what they are trying to match) I'm not 
>>>>> sure what's the point of having this complex regular expression 
>>>>> for checking our regular expression syntax subset. I haven't tried 
>>>>> to get deep understanding of this expression but I bet it will 
>>>>> match even invalid expressions. If we want to have rigorous 
>>>>> definition of our RE syntax we should provide its definition as 
>>>>> grammar written in BNF.
>>>>>
>>>>>                     Jirka
>>>>>
>>>>> --
>>>>> ------------------------------------------------------------------
>>>>>     Jirka Kosek      e-mail: jirka@kosek.cz http://xmlguru.cz
>>>>> ------------------------------------------------------------------
>>>>>          Professional XML consulting and training services
>>>>>     DocBook customization, custom XSLT/XSL-FO document processing
>>>>> ------------------------------------------------------------------
>>>>>    OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
>>>>> ------------------------------------------------------------------
>>>>>       Bringing you XML Prague conference http://xmlprague.cz
>>>>> ------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>
>
Received on Monday, 8 April 2013 10:59:45 UTC