3

Recently, I started using the email validation regular expression from the JQuery validation plugin in my Rails models.

EMAIL_REGEXP=/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$/i

"[email protected]".match EMAIL_REGEXP  # returns immidiately
"aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

The regular expression takes a long time when an invalid email has many dot separated tokens (E.g: first.middle.last@gmail). Same expression works without any noticable delay in JavaScript.

Why is there such a difference in performance between Ruby and JavaScript regular expression parsers? Is there anything I can do to improve the response time?

I am on Ruby 1.8.7. I don't see the same issue on Ruby 1.9.2.

Note

I know the reg-exp is long. Since it is used by jQuery, I thought of using it. I can always change it back to a simpler regexp as shown here. My question is mostly about finding out the reason why the same regular expression is much faster in JS.

Reference:

JQuery Validation Plugin Source

Sample form with jQuery email validation

7
  • 2
    That is one huge regex. Why don't you use an email parser? Commented Nov 29, 2011 at 4:44
  • @Blender comment quality would improve immensely if you supplied a link Commented Nov 29, 2011 at 4:45
  • 2
    Well this regular expression is taken from the official jQuery plugin and it is used widely. So I thought it is a safe route. I might go down the route of custom validation if I can't get the RegExp to work. Commented Nov 29, 2011 at 4:46
  • Another good question is, what are you trying to accomplish by matching the email address? You can't prove if it is valid; because even if a pattern matches, the address might not be live. The accepted way to test if an email is valid is to send a message to that address asking for a reply, and see if you get the response. See stackoverflow.com/q/1910340/128421 Commented Nov 29, 2011 at 5:19
  • which ruby version? 1.9 handles regex very different from 1.8 Commented Nov 29, 2011 at 5:22

2 Answers 2

1

Don't know why regex parser from 1.8.7 is so much slower than the one from JS or Oniguruma from 1.9.2, but may be this particular regex can benefit from wrapping its prefix including @ symbol with atomic group like that:

EMAIL_REGEXP = /
  ^
  (?>(( # atomic group start
    ([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+
    (\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*
   )
   |
   (
     (\x22)
     (
       (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
       (
         ([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
         |
         (\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))
       )
     )*
     (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
     (\x22)
   )
  )
  @) # atomic group end
  (
    (
      ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      |
      (
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
        ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      )
    )
    \.
  )+
  (
    ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    |
    (
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    )
  )
  $
  /xi

puts "[email protected]".match EMAIL_REGEXP  # returns immediately
puts "aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

Atomic group in that case should prevent parser from returning to the first part of the string when matching the part following @ symbol failed. And it gives significant speed-up. Though, I'm not 100% sure that it doesn't break regexp logic, so I'd appreciate any comments.

Another thing is using non-capturing groups that should be faster in general when you don't need to backreference for groups, but they don't give any noticeable improvement in this case.

Sign up to request clarification or add additional context in comments.

Comments

1

The problem might be in that your Regexp contains a greedy quantifiers, so Ruby as those quantifiers require trying to check all combinations. The solution might be to use Possessive Quantifiers, so the look-up whould be mush faster but it will change regexp so some strings will no longer match. Short Example (from wikipedia):

'aaaaaaaaaaaaaaaaaaaaaaaaa' =~ /(a+a+)/ => match
'aaaaaaaaaaaaaaaaaaaaaaaaa' =~ /(a++a+)/ => not match

The differense is in lookup process and in greedy quantifiers engine trying to look-back if no match, in the case of possessive quantifiers engine never looks back.

4 Comments

I was expecting the regexp to be slower due to greedy quantifiers. What surprises me is the difference in performance between the JS and Ruby execution times for the same reg-exp.
I have tryed your exmaple at tryruby and there was not big delays in code execution. By the way, second string dosnt match.
Tryruby uses 1.9.2 that has new regex engine and handles this regexp very good. On the other hand 1.8.7 simply dies on the second example from the question.
Tryruby uses ruby 1.9.2. I see the issue only on 1.8.7.Second string is not supposed to match.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.