Edit comment LC-2108 for Efficient Extensible Interchange Working Group

Previous: LC-2248 Next: LC-2130

Comment LC-2108: Comment Shortname:; Commenter: <pub@upokecenter.com>; Message-id:; Section of the document concerned:

or; Refinement of the section concerned:; Status:
assigned to; Type:; Resolution status:
Response drafted Resolution implemented Reply sent to commenter; Response status:
No response from Commenter yet Commenter approved disposition Commenter objected to disposition
Commenter's response (URI):; Comment:
I want to make a suggestion on the section 'Deriving Character Sets from XML Schema Regular Expressions':

I want to propose that datatypes with a regular expression containing a "charClassSub" should have no restricted character set. The reason is that all the remaining parts of the regular expression derivation expect only a union of characters, which is very efficient in determining whether the expression contains a restricted character set or not. Having a 'charClassSub' as part of the derivation process may complicate this, as the program now has to subtract portions of the character set as well as add to them, which may be a problem if the character set contains a large number of characters, like this:

[ -＀-[`-&#xFF00]]

That regular expression above would yield a restricted character set of 64 characters; however the implementation may require storing thousands of characters (a naive implementation, yes) before it must exclude them in the 'charClassSub' portion of the regular expression. Another problem is nested 'charClassSub' sets. For example, the following regular expression is allowed:

[A-Z-[B-Z-[C-Z-[D-Z-[E-Z-[...]]]]]]

Both problems make 'charClassSub' problematic in restricted character set derivation. I thank you for your time.; Related issues: (space separated ids); WG Notes:; Resolution: We excluded those regular expressions that contain either wildcard (".") or negative character groups, since those expressions tend to result in large set of characters. Even those rare cases that are not the case often have better alternative ways to specify the same effect, such as [^0-􏿿] can be expressed otherwise simply as [-/]. On the other hand, character class subtraction is retained because, unlike wildcard or negative character groups, the operation always results in a number of characters smaller than that of the 1st operand. We also found that character class subtraction does not add much computational burden if it is properly implemented. Please also note that it is our expectation that schema authors can provide some help by being aware of the general cost of each operation and specifying patterns in ways more friendly to EXI processing.; (Please make sure the resolution is adapted for public consumption)