avoid regex [python]

Question

I'd like to know if it's a good idea avoid regex.

actually I have avoided it in any case and some peoples has been giving me advice that i shouldn't avoid it, since if you know what means every thing like:

[] '|' \A \B \d \D \W \w \S \Z $ * ? ...

it would be easy to read, right? but i fell like avoiding regex i would have a more readable code.

it gets more unreadable when it's bigger, example: validators.py

email_re = re.compile(
    r"(^[-!#$%&'*+/=?^_`{}|~0-9A-Z]+(\.[-!#$%&'*+/=?^_`{}|~0-9A-Z]+)*"  # dot-atom
    r'|^"([\001-\010\013\014\016-\037!#-\[\]-\177]|\\[\001-011\013\014\016-\177])*"' #     quoted-string
    r')@(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?$', re.IGNORECASE)  # domain

so, I'd like to know a reason to not avoid regex?

an email. kind of badly, if my regex reading skills are still up to par. — muhmuhten
– muhmuhten, Commented Aug 30, 2010 at 2:00
I'd like to avoid coding. I've been avoiding it, but people keep telling me I shouldn't avoid it. But as you know, that means using curly braces and weirdCapitalization and it makes it harder to read. — Mark Thomas
– Mark Thomas, Commented Aug 30, 2010 at 2:01
I'd say having a block of regex makes the code more readable overall than having lots of lines of code that do the equivalent. Even if that short block is extremely unreadable, it's easier to skip over it while you're reading the code than if you have a really long function that does the same thing. (And that function might end up being as unreadable as the regex b/c it has to do the same thing.) — avacariu
– avacariu, Commented Aug 30, 2010 at 2:07
Do not use regular expressions for matching email addresses. They are very complex beasts and I don't know that there even is a regular expression that can match them. Unfortunately, Python seems to lack a standard library that can do that parsing for you. So you are doomed to using regular expressions and getting it wrong so that some subset of people can't use their email address in your form. sigh — Omnifarious
– Omnifarious, Commented Aug 30, 2010 at 2:41

paxdiablo · Accepted Answer · 2010-08-30 02:05:29Z

19

No, don't avoid regular expressions. They're actually quite a nifty little tool and will save you a lot of work if you use them wisely.

What you do need to avoid is trying to use it for everything, a malaise that appears to strike those new to regular expressions before they become a little more tempered and a little less enamoured :-)

For example, don't use it to validate email addresses. The way you validate an email address is to send an email to it with a link that the receiver has to click on to complete the "transaction".

There are billions of valid email addresses (according to the RFCs) that have no physical email receiver behind them. The only way to be certain that there is a receiver is to send an email and wait for proof positive that it was received and acted upon.

If I find myself writing a regular expression that's more than, let's say, 60 characters, I step back to see if there's a more readable way. Similarly, if I write a regular expression and come back a week later and can't instantly recognise what it does, I think about replacing it. This particular paragraph consists of my opinions of course, but they've served me well :-)

answered Aug 30, 2010 at 2:05

paxdiablo

888k243 gold badges1.6k silver badges2k bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

avacariu Over a year ago

I agree having an email sent to confirm the existence of the address is great, but it's nice to check if the email address entered is invalid. The user might forget the @ and you can check if it's there and give an error. It's better to do that than accept it and fail at emailing the message. The user wouldn't know why he's not getting his email.

detly Over a year ago

@vlad003 - so then you just use if "@" in email_address... - in which case, a regex is overkill. Anything more complicated than that, and you're asking for trouble...

paxdiablo Over a year ago

@vlad, there's a big difference between checking for a "@" and the monstrosity you have to use for a fully validated email address. By all means do a simple check like that, it's at least readable :-)

avacariu Over a year ago

The @ was just an example. There may be many errors a person could make when typing in their email. If it's invalid and the app accepts it, then the person will hit submit and expect their email (which they'll never get). Resending won't work; and changing the address won't be possible either... And I'm sure servers creating email addresses will need to know if the one the user wants is valid or not.

Avi Over a year ago

The point is that you shouldn't avoid helping the user because you are afraid of writing regular expressions. Forcing the user to confirm by sending them an email is great, but it doesn't solve the same problem as checking that the email is valid. Forcing the user to type the email twice is even worse - he will probably just end up copy-and-pasting it, and you still haven't helped catch trivial mistakes.

|

Bryan Oakley · Accepted Answer · 2010-08-30 02:37:38Z

6

Regular expressions are a tool. They are perfectly suited to some tasks and not to others. Like any tool, use them when they are the right tool for the job. Don't just avoid them because somebody said they were bad. Learn how to use them and then you can decide for yourself rather then depending on someone elses dogma.

answered Aug 30, 2010 at 2:37

Bryan Oakley

389k53 gold badges582 silver badges739 bronze badges

Comments

Alex Martelli · Accepted Answer · 2010-08-30 02:26:10Z

If you choose to use a more general parsing approach, like pyparsing or PLY, you will never require regular expressions (which can only match a small subset of the languages matchable with such general parsers). However, lexers such as the one in PLY are typically built around regular expressions (which are a perfect match for a lexer's needs!), so you will probably have to avoid that (as well as powerful tools such as BeautifulSoup when any "normal" user would be able to keep using and enjoying it by simply passing a regular expression object as the selector, since BeautifulSoup fully supports that) and will have to recode a lot of such existing parsers with your chosen general-purpose parsing package.

Performance may suffer greatly, of course, by using extremely general tools in cases where simpler, highly optimized and concise ones would be a perfect solution -- and the size of your code may "blow up" to being very large in many common cases. But if you don't mind having programs twice as big and twice as slow, and are determined to avoid regular expressions at all costs, you can do that.

On the other hand, if your main concern is with readability (quite an understandable and commendable concern, too), then the re.VERBOSE option, by allowing abundant use of whitespace and comments within the RE's pattern, can really do wonders for that goal without removing any of REs' advantages (except by diluting a sometimes-excessive conciseness;-). You WILL want to also keep at least one general-purpose parsing system under your belt, of course (rather than stretch REs to do tasks they're wrong for, as so many people unfortunately do!) -- but a minimal command of REs will serve you well in so many cases (including, for example, full use of BeautifulSoup and many other tools which can accept REs as parameters to apply them appropriately) that I think it's quite to be recommended.

Tony Veijalainen · Accepted Answer · 2010-08-31 08:08:04Z

Just for some comparisions, here my version email format check not with regexp (with test cases) and one readable regexp offered to me as alternative (though sending email after it is accepted, is great idea):

# -*- coding: utf8 -*- 
import string
print("Valid letters in this computer are: "+string.letters)
import re 
def validateEmail(a): 
    sep=[x for x in a if not (x.isalpha() or 
                              x.isdigit() or 
                              x in r"!#$%&'*+-/=?^_`{|}~]") ] 
    sepjoined=''.join(sep) 
    ## sep joined must be ..@.... form 
    if len(a)>255 or sepjoined.strip('.') != '@': return False 
    end=a 
    for i in sep: 
        part,i,end=end.partition(i) 
        if len(part)<2: return False 
    return len(end)>1 

def emailval(address): 
    pattern = "[\.\w]{2,}[@]\w+[.]\w+" 
    return re.match(pattern, address)

if __name__ == '__main__': 
    emails = [ "[email protected]","[email protected]", "[email protected]", 
               "[email protected]", "[email protected]","marjaliisa.hämälä[email protected]", 
               "marja-liisa.hämälä[email protected]", "marjaliisah@hel.",'tony@localhost',
               '[email protected]','me@somewhere'] 

    print('\n\t'.join(["Valid emails are:"] + 
                      filter(validateEmail,emails)))

    print('\n\t'.join(["Regexp gives wrong answer:"] + 
                       filter(emailval,emails)))

""" Output:
Valid letters in this computer are: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
Valid emails are:
        [email protected]
        [email protected]
        tony@localhost
        [email protected]
        me@somewhere
Regexp gives wrong answer:
        [email protected]
        [email protected]
        [email protected]
"""

EDIT: cleaned up the regex filter function from this ancient code, edited for @detly link based more permissive version. Good enough for form filling first check for me before sending the confirmation email. Finaly put the 255 character length limit check mentioned in comments.

This code by purpose does not accept the normal a@b as valid email address, but does accept me@somewhere. Another thing is that it depends of what isalpha returns. So this output, which is from Ideone.com has not accepted the scandinavian öä even they are valid nowadays. When run in my home computer, those are accepted. This is even when coding line is there.

kindall · Accepted Answer · 2010-08-30 05:35:17Z

0

(Deleted a regular expression which purported to be an "official" one but is in fact not found in the RFC it claimed to be from.)

This regex may be amusing as it is an attempt to precisely match the e-mail address grammar provided in an older version of the Internet mail standards.

edited Aug 30, 2010 at 5:35

answered Aug 30, 2010 at 4:47

kindall

185k36 gold badges291 silver badges321 bronze badges

2 Comments

paxdiablo Over a year ago

Putting "official" inside quotes is a dead giveaway that it's anything but official :-)

kindall Over a year ago

I went looking for how "official" it was and discovered that you were right. So I substituted a link to an even hairier regex that claims to fulfill most of the RFC 822 standards. :-)

damzam · Accepted Answer · 2010-08-30 15:18:06Z

-2

Regular expressions are likely the right tool for extracting/validating email addresses...

To extract one or more email addresses from raw text:

import re
pat_e = re.compile(r'(?P<email>[\w.+-]+@(?:[\w-]+\.)+[a-zA-Z]{2,})')
emails = []
for r in pat_e.finditer(text):
  emails.append(r.group('email'))
return emails

To see if a single piece of text is a valid email:

import re
pat_m = re.compile(r'([\w.+-]+@(?:[\w-]+\.)+[a-zA-Z]{2,}$)')
if pat_m.match(text):
  return True
return False

edited Aug 30, 2010 at 15:18

answered Aug 30, 2010 at 3:10

damzam

1,9611 gold badge15 silver badges18 bronze badges

5 Comments

detly Over a year ago

This fails on anything with a plus sign (+) before the @, which is perfectly valid for an email address.

Gabe Over a year ago

What happens when they decide to create a 5-letter TLD?

Schnouki Over a year ago

Ever heard of .museum and .travel TLDs?

damzam Over a year ago

Thanks for the comments. I've updated the patterns to accommodate longer TLDs and the character '+' when it appears before the @.

detly Over a year ago

This still does not fit the standards. "a+!=" is a valid local part of the address. As is ".{^_^}."

Collectives™ on Stack Overflow

avoid regex [python]

6 Answers 6

11 Comments

Comments

Comments

Comments

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

11 Comments

Comments

Comments

Comments

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related