0

Scraping newbie here. I'm trying to write a scraper with BeautifulSoup that scrapes html tables from emails in a Gmail account. Using IMAP, the script checks an inbox intermittently. I'm not sure though how to extract the HTML from the email, which is needed for scraping the tables. Currently, it extracts the body text, not the raw HTML:

m.select("[Gmail]/All Mail") 

resp, items = m.search(None, "ALL") 
items = items[0].split() 
for emailid in items:
    resp, data = m.fetch(emailid, "(RFC822)") 
    email_body = data[0][1] # getting the mail content
    mail = email.message_from_string(email_body)  
    soup = BeautifulSoup(mail)
    tables = soup.find_all("table", width=900)
    ...
2
  • Can't answer this off the top of my head, but you probably want docs.python.org/2/library/… and then look for an item in the list with an HTML-ish MIME type. Generally an HTML email is a multipart message containing both HTML and plain text, hence if BeautifulSoup is seeing the "wrong" format with your current code then you need to look for the right one. Commented Jan 7, 2014 at 2:12
  • You'll want to fetch (BODY[1]) or (BODY[2]) or so, and qp-decode that. In your case you might just start at 1 and loop upwards until you hit HTML. Commented Jan 7, 2014 at 6:53

1 Answer 1

1

Thanks guys! I found a very simple solution after I realized that the HTML was still being extracted, just after the body text.

for emailid in items:
    resp, data = m.fetch(emailid, "(RFC822)") # fetching the mail, "`(RFC822)`" means "get the whole stuff", but you can ask for headers only, etc
    email_body = data[0][1] # getting the mail content
    start = email_body.find('<div');
    email = email_body[start:]  
    soup = BeautifulSoup(email)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.