19778 2012-10-30 13:50:20 +0000 Wrong regex for integer 2013-08-05 05:51:37 +0000 1 1 1 Unclassified WebAppsWG WebIDL unspecified PC Windows 3.1 RESOLVED FIXED P2 normal --- 1 robin cam mike nbarth+w3bugzilla public-script-coord public-webapps-bugzilla oldest_to_newest 77416 0 robin 2012-10-30 13:50:20 +0000 The current regex for integer is: /^-?(0([0-7]*|[Xx][0-9A-Fa-f]+)|[1-9][0-9]*)/ If you feed it something like 0xDEADBEEF, which is valid, it will not tokenise properly. First, the leading 0 will match. Then it will enter the following parentheses, where it will see [0-7]*. Since that is very happy to match nothing, it succeeds and returns 0. You want: /^-?(0([Xx][0-9A-Fa-f]+|[0-7]*)|[1-9][0-9]*)/ 79611 1 cam 2012-12-07 04:51:09 +0000 But there is also the statement: When tokenizing, the longest possible match MUST be used. just below those that in the spec. So I think it should be fine. The order of the alternation shouldn't matter if you are just using the regular expressions for testing. But if you are doing something like: if ($input =~ s/^float_regex//) { } elsif ($input =~ s/^integer_regex//) { } elsif ($input =~ s/^identifier_regex//) { } ... then you could run in to trouble. I've switched it around in case people are directly using the regexes like this. 88172 2 nbarth+w3bugzilla 2013-05-24 05:32:59 +0000 Thanks for fixing this! Stylistically, I think this would be even clearer (and slightly shorter): -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*) ...namely having 3 alternatives – decimal, hex, octal – rather than nesting ((hex + octal) + decimal). To elaborate: The spec was formally correct, since it requires greedy matching (longest possible match), but in practice the previous regex doesn't work. Using regexes (and production rules) in the spec that can be directly used in real-world engines helps avoid errors and make validation simpler: it's much easier to check that two regexes are identical than equivalent. This caused the following bug in Chromium, now fixed (using revised regex): Chromium 243263: IDL lexer mistokenizes hexadecimals https://code.google.com/p/chromium/issues/detail?id=243263 The problem is that most regex engines (including Perl and Python) have eager matching on alternation, hence matching isn't completely greedy (as required by the spec). In this case the octal pattern always matches before the hex, and thus hex numbers are tokenized as 0 + identifier, e.g., '0x123' becomes integer '0' + identifier 'x123'. As Robin suggested, this problem is avoided by swapping the patterns, putting the longer hex pattern before the octal pattern, so the eager matching ends up being greedy. We can make it slightly clearer by splitting by base (instead of combining the leading 0 in hex and octal), which makes it a bit more legible to my eye and slightly shorter (b/c of no nesting). I don't know if there's performance impact either way, and it's not necessary for correctness. For reference, the Regular Expressions Cookbook has a recipe in 7.3 Numeric Constants (p. 413), where they split by base (in their re order doesn't matter because they require word boundaries, but for tokenizing we don't have this). http://books.google.com/books?id=6k7IfACN_P8C&pg=PA413 Completely unambiguous would be (decimal|hex|octal|0), but combining octal with zero is common and fine. So the revised regex in fine and works correctly in real-world engines, though I'd suggest a stylistic revision to: -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*) 88173 3 1363 nbarth+w3bugzilla 2013-05-24 05:35:15 +0000 Created attachment 1363 Python test case for integer regexes For reference, here's a short Python script showing behavior of the various patterns (original, revised, my proposal). 91564 4 cam 2013-08-03 05:21:00 +0000 OK, I tweaked the regex as you suggested Nils. http://dev.w3.org/cvsweb/2006/webapi/WebIDL/Overview.xml.diff?r1=1.661;r2=1.662;f=h http://dev.w3.org/cvsweb/2006/webapi/WebIDL/v1.xml.diff?r1=1.101;r2=1.102;f=h 91632 5 nbarth+w3bugzilla 2013-08-05 05:51:37 +0000 Thanks Cameron! Updating at Chromium: Issue 22140002: IDL lexer: update integer regex to latest Web IDL spec https://codereview.chromium.org/22140002/ 1363 2013-05-24 05:35:15 +0000 2013-05-24 05:35:15 +0000 Python test case for integer regexes integer_re.py text/x-python 360 nbarth+w3bugzilla aW1wb3J0IHJlCmludGVnZXJfcmUgICAgICAgPSByJy0/KDAoWzAtN10qfFtYeF1bMC05QS1GYS1m XSspfFsxLTldWzAtOV0qKScKaW50ZWdlcl9yZV9lYWdlciA9IHInLT8oMChbWHhdWzAtOUEtRmEt Zl0rfFswLTddKil8WzEtOV1bMC05XSopJwppbnRlZ2VyX3JlX2FsdCAgID0gcictPyhbMS05XVsw LTldKnwwW1h4XVswLTlBLUZhLWZdK3wwWzAtN10qKScKdGV4dCA9ICcweDEyMycKcHJpbnQgcmUu bWF0Y2goaW50ZWdlcl9yZSwgICAgICAgdGV4dCkuZ3JvdXAoKQpwcmludCByZS5tYXRjaChpbnRl Z2VyX3JlX2VhZ2VyLCB0ZXh0KS5ncm91cCgpCnByaW50IHJlLm1hdGNoKGludGVnZXJfcmVfYWx0 LCAgIHRleHQpLmdyb3VwKCkK