RE: forbiddenCharacters data category - related to [ACTIOn-189] from Yves Savourel on 2012-08-28 (public-multilingualweb-lt@w3.org from August 2012)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Tue, 28 Aug 2012 06:03:54 -0600
To: <public-multilingualweb-lt@w3.org>
Message-ID: <assp.058700cda4.assp.058742e1e2.005b01cd8515$3298e220$97caa660$@com>

Hi Felix,

> 1) Users of forbidden characters will read in the spec 
> "you an use regular expressions. Btw., you should restrict 
> yourself to the following subset: [...]
> 2) Users will be happy saying "great, I can use my own 
> engine and don't have to care about regex details, 
> that is [...] above
> 3) At the end there will be a lot of regex that exceed 
> the subset - and no producer or consumer will have a 
> means to check that.

I'm afraid I'm not following the reasoning: The scenario you describe can occur regardless whether the regex allowed is a sub-set or the full set of XML schema's regex.

The only way to enforce the syntax (whatever it is) is to have the ITS schema and checker tools to validate the syntax. And that can be done using XML Schema's pattern facet.


> Why is 3)? In your initial proposal you had e.g. "<>",
> and I found out that this is not conform to XML Schema
> just by checking the regex in an XML Schema editor.

This is not a regex issue: User will forget to escape < and " and & with any regex we decide on, and that is caught by any XML parser (or I'm still missing something).


> If we go for the "you can use any engine ..." approach,
> my prediction is that there will be even less interop of 
> people using forbidden characters than with SRX. People will 
> use forbiddencharacters as "forbidden strings" and do 
> what they want.
> With XML Schema regex, we easily can build test cases 
> containing e.g. "<>". The XML Schema regex engine will 
> tell people that they are doing something wrong, as 
> Jirka pointed out. With our hand written "subset regex", 
> we would need to build our own reg ex conformance test 
> suite. I don't want to go that path.

I think we can have the ITS schema use the pattern facet to validate the sub-set.

I also think if we go with the full set we must do the same as well: users can't use test suites to validate their expressions.

All those tests can (and should) be done regardless of the set we choose. 


> I understand your argument about adding dependencies. For JavaScript,
> I'm aware of saxon CE, so you can use XML Schema regex here. 
> For ruby, python: I don't know. But the XML Schema we are talking 
> about is around since 2001 ... so I think it is reasonable to say:
> if you want to use ITS 2.0, you need to take the effort to resolve 
> the dependencies. The effort is not zero, but I think it's the only 
> way to assure long-term interop between users of forbidden characters.

I disagree: why make it complicated for implementers when there is a simpler solution that solve the same problem and can be validated like the more costly solution?

The only thing that the full set option brings us is a little bit more flexibility to write the more complex patterns. In my experience that is less important than better interoperability.

But I've made my case. Time to move on: Jirka's suggestion is fair enough.

Cheers,
-yves

Received on Tuesday, 28 August 2012 12:04:28 UTC