Scraping Email HTML via IMAP

Question

Scraping newbie here. I'm trying to write a scraper with BeautifulSoup that scrapes html tables from emails in a Gmail account. Using IMAP, the script checks an inbox intermittently. I'm not sure though how to extract the HTML from the email, which is needed for scraping the tables. Currently, it extracts the body text, not the raw HTML:

m.select("[Gmail]/All Mail") 

resp, items = m.search(None, "ALL") 
items = items[0].split() 
for emailid in items:
    resp, data = m.fetch(emailid, "(RFC822)") 
    email_body = data[0][1] # getting the mail content
    mail = email.message_from_string(email_body)  
    soup = BeautifulSoup(mail)
    tables = soup.find_all("table", width=900)
    ...

Can't answer this off the top of my head, but you probably want docs.python.org/2/library/… and then look for an item in the list with an HTML-ish MIME type. Generally an HTML email is a multipart message containing both HTML and plain text, hence if BeautifulSoup is seeing the "wrong" format with your current code then you need to look for the right one. — Steve Jessop
– Steve Jessop, Commented Jan 7, 2014 at 2:12
You'll want to fetch (BODY[1]) or (BODY[2]) or so, and qp-decode that. In your case you might just start at 1 and loop upwards until you hit HTML. — arnt
– arnt, Commented Jan 7, 2014 at 6:53

Ben Davidow · Accepted Answer · 2014-01-07 18:26:46Z

1

Thanks guys! I found a very simple solution after I realized that the HTML was still being extracted, just after the body text.

for emailid in items:
    resp, data = m.fetch(emailid, "(RFC822)") # fetching the mail, "`(RFC822)`" means "get the whole stuff", but you can ask for headers only, etc
    email_body = data[0][1] # getting the mail content
    start = email_body.find('<div');
    email = email_body[start:]  
    soup = BeautifulSoup(email)

answered Jan 7, 2014 at 18:26

Ben Davidow

1,2155 gold badges27 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scraping Email HTML via IMAP

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related