<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<!DOCTYPE bugzilla SYSTEM "https://www.w3.org/Bugs/Public/page.cgi?id=bugzilla.dtd">

<bugzilla version="5.0.4"
          urlbase="https://www.w3.org/Bugs/Public/"
          
          maintainer="sysbot+bugzilla@w3.org"
>

    <bug>
          <bug_id>19778</bug_id>
          
          <creation_ts>2012-10-30 13:50:20 +0000</creation_ts>
          <short_desc>Wrong regex for integer</short_desc>
          <delta_ts>2013-08-05 05:51:37 +0000</delta_ts>
          <reporter_accessible>1</reporter_accessible>
          <cclist_accessible>1</cclist_accessible>
          <classification_id>1</classification_id>
          <classification>Unclassified</classification>
          <product>WebAppsWG</product>
          <component>WebIDL</component>
          <version>unspecified</version>
          <rep_platform>PC</rep_platform>
          <op_sys>Windows 3.1</op_sys>
          <bug_status>RESOLVED</bug_status>
          <resolution>FIXED</resolution>
          
          
          <bug_file_loc></bug_file_loc>
          <status_whiteboard></status_whiteboard>
          <keywords></keywords>
          <priority>P2</priority>
          <bug_severity>normal</bug_severity>
          <target_milestone>---</target_milestone>
          
          
          <everconfirmed>1</everconfirmed>
          <reporter name="Robin Berjon">robin</reporter>
          <assigned_to name="Cameron McCormack">cam</assigned_to>
          <cc>mike</cc>
    
    <cc>nbarth+w3bugzilla</cc>
    
    <cc>public-script-coord</cc>
          
          <qa_contact>public-webapps-bugzilla</qa_contact>

      

      

      

          <comment_sort_order>oldest_to_newest</comment_sort_order>  
          <long_desc isprivate="0" >
    <commentid>77416</commentid>
    <comment_count>0</comment_count>
    <who name="Robin Berjon">robin</who>
    <bug_when>2012-10-30 13:50:20 +0000</bug_when>
    <thetext>The current regex for integer is:

    /^-?(0([0-7]*|[Xx][0-9A-Fa-f]+)|[1-9][0-9]*)/

If you feed it something like 0xDEADBEEF, which is valid, it will not tokenise properly. First, the leading 0 will match. Then it will enter the following parentheses, where it will see [0-7]*. Since that is very happy to match nothing, it succeeds and returns 0. You want:

    /^-?(0([Xx][0-9A-Fa-f]+|[0-7]*)|[1-9][0-9]*)/</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>79611</commentid>
    <comment_count>1</comment_count>
    <who name="Cameron McCormack">cam</who>
    <bug_when>2012-12-07 04:51:09 +0000</bug_when>
    <thetext>But there is also the statement:

  When tokenizing, the longest possible match MUST be used.

just below those that in the spec.  So I think it should be fine.  The order of the alternation shouldn&apos;t matter if you are just using the regular expressions for testing.  But if you are doing something like:

  if ($input =~ s/^float_regex//) {
  } elsif ($input =~ s/^integer_regex//) {
  } elsif ($input =~ s/^identifier_regex//) {
  } ...

then you could run in to trouble.  I&apos;ve switched it around in case people are directly using the regexes like this.</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>88172</commentid>
    <comment_count>2</comment_count>
    <who name="Nils Barth">nbarth+w3bugzilla</who>
    <bug_when>2013-05-24 05:32:59 +0000</bug_when>
    <thetext>Thanks for fixing this!

Stylistically, I think this would be even clearer (and slightly shorter):
    -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*)
...namely having 3 alternatives – decimal, hex, octal – rather than nesting ((hex + octal) + decimal).

To elaborate:
The spec was formally correct, since it requires greedy matching (longest possible match), but in practice the previous regex doesn&apos;t work.

Using regexes (and production rules) in the spec that can be directly used in real-world engines helps avoid errors and make validation simpler: it&apos;s much easier to check that two regexes are identical than equivalent.

This caused the following bug in Chromium, now fixed (using revised regex):
Chromium 243263: IDL lexer mistokenizes hexadecimals
https://code.google.com/p/chromium/issues/detail?id=243263

The problem is that most regex engines (including Perl and Python) have eager matching on alternation, hence matching isn&apos;t completely greedy (as required by the spec).
In this case the octal pattern always matches before the hex, and thus hex numbers are tokenized as 0 + identifier, e.g., &apos;0x123&apos; becomes integer &apos;0&apos; + identifier &apos;x123&apos;.

As Robin suggested, this problem is avoided by swapping the patterns, putting the longer hex pattern before the octal pattern, so the eager matching ends up being greedy.


We can make it slightly clearer by splitting by base (instead of combining the leading 0 in hex and octal), which makes it a bit more legible to my eye and slightly shorter (b/c of no nesting).
I don&apos;t know if there&apos;s performance impact either way, and it&apos;s not necessary for correctness.


For reference, the Regular Expressions Cookbook has a recipe in 7.3 Numeric Constants (p. 413), where they split by base (in their re order doesn&apos;t matter because they require word boundaries, but for tokenizing we don&apos;t have this).
http://books.google.com/books?id=6k7IfACN_P8C&amp;pg=PA413

Completely unambiguous would be (decimal|hex|octal|0), but combining octal with zero is common and fine.

So the revised regex in fine and works correctly in real-world engines, though I&apos;d suggest a stylistic revision to:
    -?([1-9][0-9]*|0[Xx][0-9A-Fa-f]+|0[0-7]*)</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>88173</commentid>
    <comment_count>3</comment_count>
      <attachid>1363</attachid>
    <who name="Nils Barth">nbarth+w3bugzilla</who>
    <bug_when>2013-05-24 05:35:15 +0000</bug_when>
    <thetext>Created attachment 1363
Python test case for integer regexes

For reference, here&apos;s a short Python script showing behavior of the various patterns (original, revised, my proposal).</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>91564</commentid>
    <comment_count>4</comment_count>
    <who name="Cameron McCormack">cam</who>
    <bug_when>2013-08-03 05:21:00 +0000</bug_when>
    <thetext>OK, I tweaked the regex as you suggested Nils.

http://dev.w3.org/cvsweb/2006/webapi/WebIDL/Overview.xml.diff?r1=1.661;r2=1.662;f=h
http://dev.w3.org/cvsweb/2006/webapi/WebIDL/v1.xml.diff?r1=1.101;r2=1.102;f=h</thetext>
  </long_desc><long_desc isprivate="0" >
    <commentid>91632</commentid>
    <comment_count>5</comment_count>
    <who name="Nils Barth">nbarth+w3bugzilla</who>
    <bug_when>2013-08-05 05:51:37 +0000</bug_when>
    <thetext>Thanks Cameron!

Updating at Chromium:
Issue 22140002: IDL lexer: update integer regex to latest Web IDL spec
https://codereview.chromium.org/22140002/</thetext>
  </long_desc>
      
          <attachment
              isobsolete="0"
              ispatch="0"
              isprivate="0"
          >
            <attachid>1363</attachid>
            <date>2013-05-24 05:35:15 +0000</date>
            <delta_ts>2013-05-24 05:35:15 +0000</delta_ts>
            <desc>Python test case for integer regexes</desc>
            <filename>integer_re.py</filename>
            <type>text/x-python</type>
            <size>360</size>
            <attacher name="Nils Barth">nbarth+w3bugzilla</attacher>
            
              <data encoding="base64">aW1wb3J0IHJlCmludGVnZXJfcmUgICAgICAgPSByJy0/KDAoWzAtN10qfFtYeF1bMC05QS1GYS1m
XSspfFsxLTldWzAtOV0qKScKaW50ZWdlcl9yZV9lYWdlciA9IHInLT8oMChbWHhdWzAtOUEtRmEt
Zl0rfFswLTddKil8WzEtOV1bMC05XSopJwppbnRlZ2VyX3JlX2FsdCAgID0gcictPyhbMS05XVsw
LTldKnwwW1h4XVswLTlBLUZhLWZdK3wwWzAtN10qKScKdGV4dCA9ICcweDEyMycKcHJpbnQgcmUu
bWF0Y2goaW50ZWdlcl9yZSwgICAgICAgdGV4dCkuZ3JvdXAoKQpwcmludCByZS5tYXRjaChpbnRl
Z2VyX3JlX2VhZ2VyLCB0ZXh0KS5ncm91cCgpCnByaW50IHJlLm1hdGNoKGludGVnZXJfcmVfYWx0
LCAgIHRleHQpLmdyb3VwKCkK
</data>

          </attachment>
      

    </bug>

</bugzilla>