24780 – Request to clarify proper use of explicit UCS code point numbers in regular expression.

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 24780 - Request to clarify proper use of explicit UCS code point numbers in regular expression.

Summary: Request to clarify proper use of explicit UCS code point numbers in regular e...

Status:	NEW

Alias:	None

Product:	XML Schema
Classification:	Unclassified
Component:	Datatypes: XSD Part 2 (show other bugs)
Version:	1.0/1.1 both
Hardware:	PC Windows NT

Importance:	P2 minor
Target Milestone:	---
Assignee:	David Ezell
QA Contact:	XML Schema comments list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-02-23 14:48 UTC by Tom
Modified:	2014-02-24 10:07 UTC (History)
CC List:	2 users (show)

See Also:

Attachments

Description Tom 2014-02-23 14:48:25 UTC

The XSD 1.0 and 1.1 drafts do not currently offer clarification on how one can specify characters within regular expression character class expressions by using UCS code point numbers explicitly. The drafts do currently use EBNF notation to indicate certain characters by using #xNN where NN is the hex value of the character, but that doesn't help clarify the former. The following offers information on my experience, which, if I'm not blindly missing something, may be worthwhile feedback.

For example, consider the following from the 1.1 XSD draft (also, I believe, in the 1.0 draft):

NormalChar ::= [^.\?*+{}()|#x5B#x5D]

Obviously, this is using a notation, not for the XSD author, but rather as a documentation convention utilized to convey something to the author. This is evident in the fact that #x5B within an actual character class expression would be considered to be the characters #, X, 5, and B (4 characters), not a single character #x5B.

Since the XSD specs don't clarify this, the implication seems to be that it falls back into the XML spec's clarification of character references... that a character reference would be the way to specify any additional characters within a regular expression.

It might be nice to offer some form of clarification about this in the XSD drafts/specs.

More info...

To try to clarify this for myself, I searched within the XSD and XML drafts and found the following:

... Note: The notation #xA used here (and elsewhere in this specification) represents the Universal Character Set (UCS) code point hexadecimal A (line feed), which is denoted by U+000A. This notation is to be distinguished from &#xA;, which is the XML character reference to that same UCS code point. ...

After much consideration, I believe the two are being distinguished solely because the #xNN format is purely EBNF for documentation purposes. This latter "NOTE" also seems to imply that using character references are the only other form of escape one can use within a regular expression beyond those defined by the XSD regex docs.

In the same drafts, I see the following:

... This specification makes use of the EBNF notation used in the [XML] specification. Note that some constructs of the EBNF notation used here resemble the regular-expression syntax defined in this specification (Regular Expressions (§G)), but that they are not identical: there are differences. For a fuller description of the EBNF notation, see Section 6. Notation of the [XML] specification. ...

This clarification is confusing for someone trying to understand XSD regular expression specifics because the specification referred to above defines those regular expressions using EBNF. The regular expression syntax is called out using EBNF, yet ENBF itself is not exactly what the author of an XSD regular expression should use. Perhaps this is not generally bad, but it's confusing for someone like me who is trying to clarify what UCS code point escaping options are available for use within a XSD regex... once again, per above, the heavy implication is that, if it isn't in the XSD draft, the XML spec is the fallback... whatever it allows is what one can use (and expect tools to support).

This issue arose because I was receiving an error from software which did not like an XML configuration file I had created. It was well-formed so I wanted to verify its format against an XSD. An xmllint tool detected a failure. The failure was due to usages within a regex pattern of \xNN hex characters in the XSD regex. However, other XSD/XML validation software accepted the \xNN. That led me to what to know whom was "right" (I realize these are drafts and implementations can differ. But I was wondering if I was missing some addendum/errata or something right in the drafts themselves.)

Some clarification here may be nice to have... if I missed something pls excuse in advance. Thanks.

Comment 1 Michael Kay 2014-02-23 23:58:01 UTC

If you want to refer to characters by codepoint in an XSD regex, use XML numeric character references, e.g. value="&#x20AC;".

\x and \u are not permitted in an XSD regex. There are some XSD processors, however, that are non-conformant in this regard, and that accept these even though they are not allowed by the spec.

It is indeed confusing that the EBNF notation used to specify what is allowed in regular expressions is similar to, but not the same as, the regular expression syntax that it is specifying. Hence the note. As far as I can see, you have read the note and it has helped to resolve the confusion.

Comment 2 Tom 2014-02-24 09:56:43 UTC

(In reply to Michael Kay from comment #1)
> If you want to refer to characters by codepoint in an XSD regex, use XML
> numeric character references, e.g. value="&#x20AC;".
> 
> \x and \u are not permitted in an XSD regex. There are some XSD processors,
> however, that are non-conformant in this regard, and that accept these even
> though they are not allowed by the spec.
> 
> It is indeed confusing that the EBNF notation used to specify what is
> allowed in regular expressions is similar to, but not the same as, the
> regular expression syntax that it is specifying. Hence the note. As far as I
> can see, you have read the note and it has helped to resolve the confusion.

You are correct. I was seeing the slash '\' character and expecting an escaped slash '\\' if slash had been intended. Instead, I saw an escaped '#' character, as in '\#' ... my bad for this is EBNF outlining how to construct a regex, not show examples of them. 

Flipping back-and-forth between considering ENBF and actual examples is a skill in itself for these particular sections. :) 

I opened this 24779 up while my head was wrapped around the issue of bug 24780 which I subsequently opened (https://www.w3.org/Bugs/Public/show_bug.cgi?id=24780). On hindsight of both of these issues (non-issues), my only suggestion is that a "Note:" be placed at the introduction to the regular expression section which clearly outlines this difference. I would even recommend showing a brief example which itself contrasts the difference between a regex definition and a related actual example. 

The existing "Note:" of this kind occurs way out of context, at the start of the document, and it's not completely clear what the concern is at that time. Additionally, somebody wanting to jump to the regex section may benefit from a "Note:" they'd otherwise miss. When I had read the currently placed "Note:" I missed the importance by the time I got the regex section because it was easy to naturally fall into the trap that some of the ENBF were actual regex examples being used to show sets of characters. Completely my fault in a strict sense, but this seems like a case where a spec may offer decent payoff for the tweaking here. I'm not certain what everyone has to deal with in deciding these sorts of things so no worries if this doesn't meet any bars... all just a suggestion at this point.

Comment 3 Tom 2014-02-24 10:01:44 UTC

READ ME FIRST; Disregard the prior comment. It was meant for Bug 24779.

Comment 4 Tom 2014-02-24 10:07:27 UTC

(In reply to Michael Kay from comment #1)
> If you want to refer to characters by codepoint in an XSD regex, use XML
> numeric character references, e.g. value="&#x20AC;".
> 
> \x and \u are not permitted in an XSD regex. There are some XSD processors,
> however, that are non-conformant in this regard, and that accept these even
> though they are not allowed by the spec.
> 
> It is indeed confusing that the EBNF notation used to specify what is
> allowed in regular expressions is similar to, but not the same as, the
> regular expression syntax that it is specifying. Hence the note. As far as I
> can see, you have read the note and it has helped to resolve the confusion.

Thanks... I would recommend the regex section have a "Note:" that clarifies using something similar to your first sentence in the prior comment (see above). I would also add to the same note that \x and \u are not permitted. Because EBNF uses constructs similar to actual regular expressions, and because escaping can be a tricky topic, a small note seems worthy for the regex section. Only a suggestion. Thanks again for the clarifications.