4

I'd like to know if it's a good idea avoid regex.

actually I have avoided it in any case and some peoples has been giving me advice that i shouldn't avoid it, since if you know what means every thing like:

[] '|' \A \B \d \D \W \w \S \Z $ * ? ...

it would be easy to read, right? but i fell like avoiding regex i would have a more readable code.

it gets more unreadable when it's bigger, example: validators.py

email_re = re.compile(
    r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"  # dot-atom
    r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' #     quoted-string
    r')@(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$', re.IGNORECASE)  # domain

so, I'd like to know a reason to not avoid regex?

5
  • an email. kind of badly, if my regex reading skills are still up to par. Commented Aug 30, 2010 at 2:00
  • @sreservoir As in, an email address? Commented Aug 30, 2010 at 2:01
  • 6
    I'd like to avoid coding. I've been avoiding it, but people keep telling me I shouldn't avoid it. But as you know, that means using curly braces and weirdCapitalization and it makes it harder to read. Commented Aug 30, 2010 at 2:01
  • 1
    I'd say having a block of regex makes the code more readable overall than having lots of lines of code that do the equivalent. Even if that short block is extremely unreadable, it's easier to skip over it while you're reading the code than if you have a really long function that does the same thing. (And that function might end up being as unreadable as the regex b/c it has to do the same thing.) Commented Aug 30, 2010 at 2:07
  • 3
    Do not use regular expressions for matching email addresses. They are very complex beasts and I don't know that there even is a regular expression that can match them. Unfortunately, Python seems to lack a standard library that can do that parsing for you. So you are doomed to using regular expressions and getting it wrong so that some subset of people can't use their email address in your form. sigh Commented Aug 30, 2010 at 2:41

6 Answers 6

19

No, don't avoid regular expressions. They're actually quite a nifty little tool and will save you a lot of work if you use them wisely.

What you do need to avoid is trying to use it for everything, a malaise that appears to strike those new to regular expressions before they become a little more tempered and a little less enamoured :-)

For example, don't use it to validate email addresses. The way you validate an email address is to send an email to it with a link that the receiver has to click on to complete the "transaction".

There are billions of valid email addresses (according to the RFCs) that have no physical email receiver behind them. The only way to be certain that there is a receiver is to send an email and wait for proof positive that it was received and acted upon.

If I find myself writing a regular expression that's more than, let's say, 60 characters, I step back to see if there's a more readable way. Similarly, if I write a regular expression and come back a week later and can't instantly recognise what it does, I think about replacing it. This particular paragraph consists of my opinions of course, but they've served me well :-)

Sign up to request clarification or add additional context in comments.

11 Comments

I agree having an email sent to confirm the existence of the address is great, but it's nice to check if the email address entered is invalid. The user might forget the @ and you can check if it's there and give an error. It's better to do that than accept it and fail at emailing the message. The user wouldn't know why he's not getting his email.
@vlad003 - so then you just use if "@" in email_address... - in which case, a regex is overkill. Anything more complicated than that, and you're asking for trouble...
@vlad, there's a big difference between checking for a "@" and the monstrosity you have to use for a fully validated email address. By all means do a simple check like that, it's at least readable :-)
The @ was just an example. There may be many errors a person could make when typing in their email. If it's invalid and the app accepts it, then the person will hit submit and expect their email (which they'll never get). Resending won't work; and changing the address won't be possible either... And I'm sure servers creating email addresses will need to know if the one the user wants is valid or not.
The point is that you shouldn't avoid helping the user because you are afraid of writing regular expressions. Forcing the user to confirm by sending them an email is great, but it doesn't solve the same problem as checking that the email is valid. Forcing the user to type the email twice is even worse - he will probably just end up copy-and-pasting it, and you still haven't helped catch trivial mistakes.
|
6

Regular expressions are a tool. They are perfectly suited to some tasks and not to others. Like any tool, use them when they are the right tool for the job. Don't just avoid them because somebody said they were bad. Learn how to use them and then you can decide for yourself rather then depending on someone elses dogma.

Comments

3

If you choose to use a more general parsing approach, like pyparsing or PLY, you will never require regular expressions (which can only match a small subset of the languages matchable with such general parsers). However, lexers such as the one in PLY are typically built around regular expressions (which are a perfect match for a lexer's needs!), so you will probably have to avoid that (as well as powerful tools such as BeautifulSoup when any "normal" user would be able to keep using and enjoying it by simply passing a regular expression object as the selector, since BeautifulSoup fully supports that) and will have to recode a lot of such existing parsers with your chosen general-purpose parsing package.

Performance may suffer greatly, of course, by using extremely general tools in cases where simpler, highly optimized and concise ones would be a perfect solution -- and the size of your code may "blow up" to being very large in many common cases. But if you don't mind having programs twice as big and twice as slow, and are determined to avoid regular expressions at all costs, you can do that.

On the other hand, if your main concern is with readability (quite an understandable and commendable concern, too), then the re.VERBOSE option, by allowing abundant use of whitespace and comments within the RE's pattern, can really do wonders for that goal without removing any of REs' advantages (except by diluting a sometimes-excessive conciseness;-). You WILL want to also keep at least one general-purpose parsing system under your belt, of course (rather than stretch REs to do tasks they're wrong for, as so many people unfortunately do!) -- but a minimal command of REs will serve you well in so many cases (including, for example, full use of BeautifulSoup and many other tools which can accept REs as parameters to apply them appropriately) that I think it's quite to be recommended.

Comments

1

Just for some comparisions, here my version email format check not with regexp (with test cases) and one readable regexp offered to me as alternative (though sending email after it is accepted, is great idea):

# -*- coding: utf8 -*- 
import string
print("Valid letters in this computer are: "+string.letters)
import re 
def validateEmail(a): 
    sep=[x for x in a if not (x.isalpha() or 
                              x.isdigit() or 
                              x in r"!#$%&'*+-/=?^_`{|}~]") ] 
    sepjoined=''.join(sep) 
    ## sep joined must be ..@.... form 
    if len(a)>255 or sepjoined.strip('.') != '@': return False 
    end=a 
    for i in sep: 
        part,i,end=end.partition(i) 
        if len(part)<2: return False 
    return len(end)>1 

def emailval(address): 
    pattern = "[\.\w]{2,}[@]\w+[.]\w+" 
    return re.match(pattern, address)

if __name__ == '__main__': 
    emails = [ "[email protected]","[email protected]", "[email protected]", 
               "[email protected]", "[email protected]","marjaliisa.hämälä[email protected]", 
               "marja-liisa.hämälä[email protected]", "marjaliisah@hel.",'tony@localhost',
               '[email protected]','me@somewhere'] 

    print('\n\t'.join(["Valid emails are:"] + 
                      filter(validateEmail,emails)))

    print('\n\t'.join(["Regexp gives wrong answer:"] + 
                       filter(emailval,emails)))

""" Output:
Valid letters in this computer are: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Valid emails are:
        [email protected]
        [email protected]
        tony@localhost
        [email protected]
        me@somewhere
Regexp gives wrong answer:
        [email protected]
        [email protected]
        [email protected]
"""

EDIT: cleaned up the regex filter function from this ancient code, edited for @detly link based more permissive version. Good enough for form filling first check for me before sending the confirmation email. Finaly put the 255 character length limit check mentioned in comments.

This code by purpose does not accept the normal a@b as valid email address, but does accept me@somewhere. Another thing is that it depends of what isalpha returns. So this output, which is from Ideone.com has not accepted the scandinavian öä even they are valid nowadays. When run in my home computer, those are accepted. This is even when coding line is there.

Comments

0

(Deleted a regular expression which purported to be an "official" one but is in fact not found in the RFC it claimed to be from.)

This regex may be amusing as it is an attempt to precisely match the e-mail address grammar provided in an older version of the Internet mail standards.

2 Comments

Putting "official" inside quotes is a dead giveaway that it's anything but official :-)
I went looking for how "official" it was and discovered that you were right. So I substituted a link to an even hairier regex that claims to fulfill most of the RFC 822 standards. :-)
-2

Regular expressions are likely the right tool for extracting/validating email addresses...

To extract one or more email addresses from raw text:

import re
pat_e = re.compile(r'(?P<email>[\w.+-]+@(?:[\w-]+\.)+[a-zA-Z]{2,})')
emails = []
for r in pat_e.finditer(text):
  emails.append(r.group('email'))
return emails

To see if a single piece of text is a valid email:

import re
pat_m = re.compile(r'([\w.+-]+@(?:[\w-]+\.)+[a-zA-Z]{2,}$)')
if pat_m.match(text):
  return True
return False

5 Comments

This fails on anything with a plus sign (+) before the @, which is perfectly valid for an email address.
What happens when they decide to create a 5-letter TLD?
Ever heard of .museum and .travel TLDs?
Thanks for the comments. I've updated the patterns to accommodate longer TLDs and the character '+' when it appears before the @.
This still does not fit the standards. "a+!=" is a valid local part of the address. As is ".{^_^}."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.