Re: [Issue-67] [Action-385] Work on regex for validating regex subset proposal

Hi Pablo,

sorry for the effort, but to move this forward, we need at least make 
sure that at least the test suite reg ex examples work.

I checked
https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/inputdata/allowedcharacters/xml
by replacing in my local copy of the test suite
https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/schema/its20-types.rng
this part
   <define name="its-allowedCharacters.type">
     <data type="string"></data>
   </define>
with this, that is inside the "pattern" element your regex for XML 
validation:
     <data type="string">
       <param 
name="pattern">^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$</param>
     </data>

That gave me validation errors like this one:

[jing] 
/its2.0/inputdata/allowedcharacters/xml/allowedcharacters7xmlrules.xml:3:100: 
error: Bad value ?[^*+]? for attribute ?allowedCharacters? on element 
?allowedCharactersRule? from namespace ?http://www.w3.org/2005/11/its?.

Could you change in your local copy of the test suite the "param" 
element with your regex so that the validation for all test suite files 
for allowed characters
https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/inputdata/allowedcharacters/
works?

FYI, the content is an XML Schema regular expression, so your XML 
version for validation should work finally, I think.

Again, sorry for the effort, but it would be great to have this done 
before the next publication, that is by Thursday next week. Would that 
work for you?

Best,

Felix

Am 05.04.13 15:39, schrieb Pablo Nieto Caride:
> Hi Felix,
>
> Yes I tried Allowed Characters Test-Suite's  example before to make sure that the regex worked, and [a-zA-Z_\-] works for me in my system, anyway I'll try what you suggest and get back to you as soon as I have the results.
>
> Cheers,
> Pablo.
> __________________________________
>
> Hi Pablo, all,
>
> Am 05.04.13 11:24, schrieb Pablo Nieto Caride:
>> Hi all,
>>
>> I have completed the regex. Finally I decided to restrict it to Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is sufficient and otherwise the regex would be very complex, besides Shaun didn't actually limit it to Plane 1 (Supplementary Multilingual Plane 10000–​1FFFF) but to Planes 15-16 (10FFFF) which is too much. I understand it covers the basics (now escapes of [, ], ^ and -) and does not match incorrect regex, such as "[f-", supports the greedy and lazy wildcard (this is not really necessary), and does not support nested character classes (do we need them? They are rarely used in general). Please test it:
>> 1) Here is the proposed regular expression escaped with XML numeric character entities, as if it were put into an XML document:
>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&
>> #x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\
>> \-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#x
>> D7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
> I tried that with an [a-zA-Z_\-]
> but got a validation error. Could you check a few examples from https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/inputdata/allowedcharacters/html/
> to make sure that the regex works? E.g. by creating a schema like the attached one and check with the regex?
>
>
> Best,
>
> Felix
>> 2) Here it is with \x{}, for Perl/PCRE only:
>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>
>> 3) And here is a regular expression that matches a subset of our subset, limited to Plane 0, with the \u escape:
>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>
>> 4) And remember, the backslashes and escaped backslashes are significant to the regular expression engine. If you're putting that into a string in a language like Java or C#, you need to escape the escapes:
>> re = new Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))*$");
>>
>> I'll proceed now to draft text explaining importance of Unicode normalization and best practices, that's Action-430.
>>
>> Cheers,
>> Pablo.
>> __________________________________
>>
>> Hi Jirka,
>>
>> It should not match invalid expressions since it only support character classes, ranges and negations, but still needs a bit of polishing regarding escapes. I don't think we need a BNF grammar, but it's not mine to decide, I just doing what I'm supposed to.
>>
>> Cheers,
>> Pablo.
>> __________________________________
>>
>> On 4.4.2013 17:12, Pablo Nieto Caride wrote:
>>> Please, implementers and whoever that is interested, give feedback if
>>> necessary so I can move forward and evolve the regex.
>> Hi,
>>
>> since such complex regular expressions are mostly write-only (it's very hard to understand what they are trying to match) I'm not sure what's the point of having this complex regular expression for checking our regular expression syntax subset. I haven't tried to get deep understanding of this expression but I bet it will match even invalid expressions. If we want to have rigorous definition of our RE syntax we should provide its definition as grammar written in BNF.
>>
>>      Jirka
>>
>> --
>> ------------------------------------------------------------------
>>     Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
>> ------------------------------------------------------------------
>>          Professional XML consulting and training services
>>     DocBook customization, custom XSLT/XSL-FO document processing
>> ------------------------------------------------------------------
>>    OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
>> ------------------------------------------------------------------
>>       Bringing you XML Prague conference    http://xmlprague.cz
>> ------------------------------------------------------------------
>>
>>
>>
>>
>

Received on Saturday, 6 April 2013 11:25:52 UTC