This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 3659 - Bugs in date/time regexes
Summary: Bugs in date/time regexes
Status: RESOLVED FIXED
Alias: None
Product: XML Schema
Classification: Unclassified
Component: Datatypes: XSD Part 2 (show other bugs)
Version: 1.1 only
Hardware: Macintosh All
: P1 normal
Target Milestone: ---
Assignee: C. M. Sperberg-McQueen
QA Contact: XML Schema comments list
URL:
Whiteboard: cluster: regex
Keywords: resolved
Depends on:
Blocks:
 
Reported: 2006-09-06 02:15 UTC by C. M. Sperberg-McQueen
Modified: 2008-06-07 01:12 UTC (History)
1 user (show)

See Also:


Attachments

Description C. M. Sperberg-McQueen 2006-09-06 02:15:29 UTC
In email to the public comments list, Laurens Holst (lholst@students.cs.uu.nl)
writes as follows.  I am copying this to the Bugzilla system for better
tracking.

The regular expressions for dates and times in the XML Schema 1.1 Datatypes working draft are not correct, they do not match the grammar. Below you can find fixed regular expressions.

Basically, I made seven modifications to the originally provided regular expressions, to make the date/time-regular expressions match the grammar:

1. Fix parenthesis; --(0[1-9])|(1[0-2])- means that it will match e.g. --01 or 12-. Instead, it should be --(0[1-9]|1[0-2])-. Also, the time match had a lot of needless parenthesis.
2. Use (0[1-9]|[12]\d|3[01]) for days everywhere instead of ([0-2][0-9])|(3[01]). The latter would allow 00.
3. Except for time, all were missing the Z in the time zone
4. Decimal did not accept values with a positive sign
5. Replaced [0-9] with \d (just like digit is used in the grammar, and its shorter)
6. Removed the \s before the - where not needed.
7. Added \s before all the + where needed (the browser complains if + is used unescaped)
8. float has a nit where I changed (-|\+) into (\+|-) to match both the production and the other regular expressions.

Here are the new regular expressions:

decimal: (\+|-)?((\d+(.\d*)?)|(.\d+))
float: (\+|-)?((\d+(.\d*)?)|(.\d+))((e|E)(\+|-)?\d+)?|-?INF|NaN
dateTime: -?([1-9]\d\d\d+|0\d\d\d)-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])T(([01]\d|2[0-3]):[0-5]\d:[0-5]\d(\.\d+)?|24:00:00(\.0+)?)(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?
time: (([01]\d|2[0-3]):[0-5]\d:[0-5]\d(\.\d+)?|24:00:00(\.0+)?)(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?
date: -?([1-9]\d\d\d+|0\d\d\d)-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?
gYearMonth: -?([1-9]\d\d\d+|0\d\d\d)-(0[1-9]|1[0-2])(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?
gYear: -?([1-9]\d\d\d+|0\d\d\d)(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?
gMonthDay: --(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?
gDay: ---(0[1-9]|[12]\d|3[01])(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?
gMonth: --(0[1-9]|1[0-2])(Z|(\+|-)(0\d|1[0-4]):[0-5]\d)?

Also, I think I found an error in the grammar; in section 3.3.5.2 it says:

The ·lexical space· of float is the set of all decimal numerals with or without a decimal point, numerals in scientific (exponential) notation, and the ·literals· 'INF', '-INF', and 'NaN'

However, the grammar doesnt contain INF, -INF, and NaN:

floatRep ::= noDecimalPtNumeral | decimalPtNumeral | scientificNotationNumeral | minimalNumericalSpecialRep

That should be:

floatRep ::= noDecimalPtNumeral | decimalPtNumeral | scientificNotationNumeral | minimalNumericalSpecialRep | 'INF' | '-INF' | 'NaN'

The same applies to double.

Finally, I created a regular expression for base64Binary:

((([A-Za-z0-9+/] ?){4})*(([A-Za-z0-9+/] ?){3}[A-Za-z0-9+/]|([A-Za-z0-9+/] ?){2}[AEIMQUYcgkosw048] ?=|[A-Za-z0-9+/] ?[AQgw] ?= ?=))?

(note: spaces are significant)


~Grauw

-- 
Ushiko-san! Kimi wa doushite, Ushiko-san nan da!!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Laurens Holst, student, university of Utrecht, the Netherlands.
Comment 1 Dave Peterson 2006-09-06 03:22:06 UTC
(In reply to comment #0)
I haven't checked the others yet, but WRT

> Also, I think I found an error in the grammar; in section 3.3.5.2 it says:
> 
> The ·lexical space· of float is the set of all decimal numerals with or without
> a decimal point, numerals in scientific (exponential) notation, and the
> ·literals· 'INF', '-INF', and 'NaN'
> 
> However, the grammar doesnt contain INF, -INF, and NaN:
> 
> floatRep ::= noDecimalPtNumeral | decimalPtNumeral | scientificNotationNumeral
> | minimalNumericalSpecialRep
> 
> That should be:
> 
> floatRep ::= noDecimalPtNumeral | decimalPtNumeral | scientificNotationNumeral
> | minimalNumericalSpecialRep | 'INF' | '-INF' | 'NaN'

since the production for minimalNumericalSpecialRep is

  minimalNumericalSpecialRep ::= 'INF' | '-INF' | 'NaN'

I think 'INF', '-INF, and 'NaN' were already covered.
Comment 2 Laurens Holst 2006-09-06 08:25:24 UTC
Ah, of course. I missed that.

~Grauw
Comment 3 Laurens Holst 2006-09-07 13:35:04 UTC
One additional improvement:

- (\+|-) can be replaced with [+-]. Its shorter (and probably faster).


~Grauw
Comment 4 Laurens Holst 2006-09-19 12:48:07 UTC
There is another problem in decimal, float and double; the period is not escaped (thus matching any character and not just the period).

The fixed patterns are:

   decimal: /^[+\-]?((\d+(\.\d*)?)|(\.\d+))$/,
   float: /^([+\-]?((\d+(\.\d*)?)|(\.\d+))([eE][+\-]?\d+)?|-?INF|NaN)$/,

Also note the modification of (e|E) into [eE] and (+|-) into [+\-].


~Grauw
Comment 5 Dave Peterson 2007-12-17 02:26:24 UTC
changing cluster from date/time to regex, because it needs to be dealt with along with the other regex bugs, and it has a few comments about regexes not related to date/time.  Makes for better tracking.
Comment 6 Pete Cordell 2007-12-17 20:24:30 UTC
I was looking at this bug text and one of the things it recommends is using 
\d instead of [0-9].  However, as I understand it, (sadly!) \d is equivalent 
to Unicode \p{Nd}, which matches digits in many languages such as 
ARABIC-INDIC DIGIT ZERO at character code 0660 and so on.  (I'm assuming 
here that there are no special rules in the schema spec that map \d to just 
[0-9].)

I don't think this is the intent and therefore it's appropriate to 
_un_recommend this change!!
Comment 7 C. M. Sperberg-McQueen 2008-06-04 14:33:28 UTC
A wording proposal intended to fix the bugs identified here is at 

  http://www.w3.org/XML/Group/2004/06/xmlschema-2/datatypes.b3659.html
  (member-only link)

A schema document with pattern elements for the various regexes
in the spec, those in this bug report, and some alternatives constructed
by the editors is at 

  http://www.w3.org/XML/2008/xsdl-exx/regexes.xsd

An annotated version of the same schema document, with abstract syntax
trees in the annotation child of each pattern, is at 

  http://www.w3.org/XML/2008/03/xsdl-regex/regexes.annotated.xsd

And PNG images of the parse trees are in 

  http://www.w3.org/XML/2008/03/xsdl-regex/images/

using a naming convention which should be clear.

Note that the wording proposal differs in two ways (so far) from the
sketches in the material just mentioned (as it is shown today; we may
update it to re-synch things):  (1) signs are written [+\-] not
[-+] so as to be compatible with implementations which implemented the
language of the XSD 1.0 PER and have not changed since (if there are any
such), and (2) the rules for time zones have been corrected both in the
EBNF and in the regexes, to allow time zones up to and including 14:00
but not further (so the largest time offset is 14:00, not 14:59); this
aligns the formal definitions with the prose and with the WG's intentions.
Comment 8 C. M. Sperberg-McQueen 2008-06-07 01:12:26 UTC
The proposal mentioned in comment #7 was adopted by the XML Schema
WG at today's teleconference, with one amendment.  The signs of
numbers and time zones are to be written (\+|-) rather than [+\-], 
for stylistic reasons.  (Some WG members found the latter form 
caused them to wonder about when - does need escaping and when it
doesn't, which is distracting; also, they felt the choice between
+ and - in these contexts feels like a weightier choice than usual
in character-class expressions.)

With this, the issue appears to the WG to have been resolved,
and I'm updating the record to show that.

This upate to Bugzilla should cause email to be sent to Laurens Holst,
who is the original source of this comment, and to whom the WG and
the editors offer thanks for his close attention to the regexes and
his useful corrections.  If you are happy with the resolution of the
issue, please indicate the fact by closing the issue; if you are not
happy for some reason, please let us know by reopening the issue.

The essentials of the changes made are visible in the materials referred to
in comment #7, and will be visible in context in the next public
working draft.  If we don't hear from you by a month after the publication
of the next public working draft (no date is set yet but we live in hope),
we will assume that you are content with the resolution of the issue
and will close it ourselves.  Thanks again.