Bug 19778 - Wrong regex for integer
Summary: Wrong regex for integer
Alias: None
Product: WebAppsWG
Classification: Unclassified
Component: WebIDL (show other bugs)
Version: unspecified
Hardware: PC Windows 3.1
: P2 normal
Target Milestone: ---
Assignee: Cameron McCormack
QA Contact: public-webapps-bugzilla
Depends on:
Reported: 2012-10-30 13:50 UTC by Robin Berjon
Modified: 2013-08-05 05:51 UTC (History)
3 users (show)

See Also:

Python test case for integer regexes (360 bytes, text/x-python)
2013-05-24 05:35 UTC, Nils Barth

Note You need to log in before you can comment on or make changes to this bug.
Description Robin Berjon 2012-10-30 13:50:20 UTC
The current regex for integer is:


If you feed it something like 0xDEADBEEF, which is valid, it will not tokenise properly. First, the leading 0 will match. Then it will enter the following parentheses, where it will see [0-7]*. Since that is very happy to match nothing, it succeeds and returns 0. You want:

Comment 1 Cameron McCormack 2012-12-07 04:51:09 UTC
But there is also the statement:

  When tokenizing, the longest possible match MUST be used.

just below those that in the spec.  So I think it should be fine.  The order of the alternation shouldn't matter if you are just using the regular expressions for testing.  But if you are doing something like:

  if ($input =~ s/^float_regex//) {
  } elsif ($input =~ s/^integer_regex//) {
  } elsif ($input =~ s/^identifier_regex//) {
  } ...

then you could run in to trouble.  I've switched it around in case people are directly using the regexes like this.
Comment 2 Nils Barth 2013-05-24 05:32:59 UTC
Thanks for fixing this!

Stylistically, I think this would be even clearer (and slightly shorter):
...namely having 3 alternatives – decimal, hex, octal – rather than nesting ((hex + octal) + decimal).

To elaborate:
The spec was formally correct, since it requires greedy matching (longest possible match), but in practice the previous regex doesn't work.

Using regexes (and production rules) in the spec that can be directly used in real-world engines helps avoid errors and make validation simpler: it's much easier to check that two regexes are identical than equivalent.

This caused the following bug in Chromium, now fixed (using revised regex):
Chromium 243263: IDL lexer mistokenizes hexadecimals

The problem is that most regex engines (including Perl and Python) have eager matching on alternation, hence matching isn't completely greedy (as required by the spec).
In this case the octal pattern always matches before the hex, and thus hex numbers are tokenized as 0 + identifier, e.g., '0x123' becomes integer '0' + identifier 'x123'.

As Robin suggested, this problem is avoided by swapping the patterns, putting the longer hex pattern before the octal pattern, so the eager matching ends up being greedy.

We can make it slightly clearer by splitting by base (instead of combining the leading 0 in hex and octal), which makes it a bit more legible to my eye and slightly shorter (b/c of no nesting).
I don't know if there's performance impact either way, and it's not necessary for correctness.

For reference, the Regular Expressions Cookbook has a recipe in 7.3 Numeric Constants (p. 413), where they split by base (in their re order doesn't matter because they require word boundaries, but for tokenizing we don't have this).

Completely unambiguous would be (decimal|hex|octal|0), but combining octal with zero is common and fine.

So the revised regex in fine and works correctly in real-world engines, though I'd suggest a stylistic revision to:
Comment 3 Nils Barth 2013-05-24 05:35:15 UTC
Created attachment 1363 [details]
Python test case for integer regexes

For reference, here's a short Python script showing behavior of the various patterns (original, revised, my proposal).
Comment 5 Nils Barth 2013-08-05 05:51:37 UTC
Thanks Cameron!

Updating at Chromium:
Issue 22140002: IDL lexer: update integer regex to latest Web IDL spec