2

I'm a beginner-level student of Python. Here is the code I have to find instances of email addresses from a web page.

    page = urllib.request.urlopen("http://website/category")
    reg_ex = re.compile(r'[-a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE
    m = reg_ex.search_all(page)
    m.group()

When I ran it, the Python module said that there is an invalid syntax and it is on the line:

    m = reg_ex.search_all(page)

Would anyone tell me why it is invalid?

5 Answers 5

6

Consider an alternative:

## Suppose we have a text with many email addresses
str = 'purple [email protected], blah monkey [email protected] blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) 
    ## ['[email protected]', '[email protected]']    
for email in emails:
    # do something with each found email string
    print email

Source: https://developers.google.com/edu/python/regular-expressions

Sign up to request clarification or add additional context in comments.

5 Comments

This might be the solution the OP is looking for, but it does not answer his question...
So if the OP asks a question where he is trying to get a certain output and asks why his code doesn't work, I am only supposed to tell him why his code doesn't work and not give him a better solution?
No, do both. Explain why his didn't work then provide a solution and explain why it does work.
It was explained 4 times why his doesn't work, so I didn't want to be redundant.
this regex can also match invalid email like name@example without ltd extention.
2

Besides, reg_ex has no search_all method. And you should pass in page.read().

Comments

2

You don't have closing ) at this line:

reg_ex = re.compile(r'[a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+', re.IGNORECASE)

Plus, your regex is not valid, try this instead:

"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+"

FYI, validating email using regex is not that trivial, see these threads:

2 Comments

Your suggested regex makes no sense in this use case. The OP wants to find an email address in a bunch of text, so the anchors are wrong here.
@stema ok, it was just an example, but correct, no need to put boundaries.
1

there is no .search_all method with the re module

maybe theone you are looking for is .findall

you can try

re.findall(r"(\w(?:[-.+]?\w+)+\@(?:[a-zA-Z0-9](?:[-+]?\w+)*\.)+[a-zA-Z]{2,})", text)

i assume text is the text to search, in your case should be text = page.read()

or you need to compile the regex:

r = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
results = r.findall(text)

Note: .findall returns a list of matches

if you need to iterate to get a match object, you can use .finditer

(from the example before)

r = re.compile(r"(\w(?:[-.+]?\w+)+\@(?:[a-z0-9](?:[-+]?\w+)*\.)+[a-z]{2,})", re.I)
for email_match in r.finditer(text):
    email_addr = email_match.group() #or anything you need for a matched object

Now the problem is what Regex you have to use :)

Comments

0

Change r'[-a-z0-9._]+@([-a-z0-9]+)(\.[-a-z0-9]+)+' to r'[aA-zZ0-9._]+@([aA-zZ0-9]+)(\.[aA-zZ0-9]+)+'. The - character before a-z is the cause

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.