Scrape a table class in Python

Question

I am trying to scrape http://emojipedia.org/emoji/ , but I am not sure what is the most efficient way to do so. What I would like to scrape is found inside the table class ="emoji_list". I would like to save the stuff inside each "td" in separate columns. The output will be like the following where each line represent an emoji:

Col1_Link               Col2_emoji      Col3_Comment        Col4_UTF
"/emoji/%F0%9F%98%80/"       😀        Grinning Face         U+1F600

I have written the following code so far, but I am not sure what is the best way to do that.

import requests
from bs4 import BeautifulSoup 
import urllib
import re    

url = "http://emojipedia.org/emoji/"
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
soup.findAll('tr', limit=2)

Many thanks in advance for your help.

Padraic Cunningham · Accepted Answer · 2016-07-02 00:52:30Z

3

soup.findAll('tr', limit=2) won't do much considering that just gets the first two trs on the page. You need to first find all the rows of the table then extract what you want which is inside the two tds in each tr:

import requests
from bs4 import BeautifulSoup
url = "http://emojipedia.org/emoji/"
html = requests.get(url).content

soup = BeautifulSoup(html)
table = soup.select_one("table.emoji-list")

for row in table.find_all("tr")[:5]:
    td1, td2 = row.find_all("td")
    em, desc =  td1.text.split(None, 1)
    print(td1.a["href"], em, desc, td2.text)

Another way would be to only get text without splitting would be to get the text from the a tag excluding the child text with find(text=True, recursive=False)

for row in table.find_all("tr"):
    td1, td2 = row.find_all("td")
    print(td1.a["href"], td1.a.span.text, td1.a.find(text=True, recursive=False), td2.text)

Also I would stick to using requests over urllib.

edited Jul 2, 2016 at 0:52

answered Jul 2, 2016 at 0:16

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

morfara Over a year ago

Thanks a lot! The "table = soup.select_one("table.emoji-list") " didn't work for me, but I used "table = soup.find('table', {'class': 'emoji-list'})"

Padraic Cunningham Over a year ago

@morfara, interesting, did you use requests to get the source?

morfara Over a year ago

I am new to scraping and I have to admit that it is so confusing which libraries is the best to use. Do you know any good resources that explain why requests is better over urllib? P.s. 1 Yes, I used it, but it gives me "TypeError: 'NoneType' object is not callable" P.s. 2 For the td1.text, I get "u'\U0001f600 Grinning Face' " as the output. Is there any easy way to keep only the English and drop the unicode? Thanks again!

Padraic Cunningham Over a year ago

What version of bs4 are you using? requests is just as the documentation states http for humans, it makes a lot of complicated http requests very simple, it would be part of the standard lib but the author does not want to do that for various reasons. If you are doing http in python, unless you really have a grasp of it then you should be using requests.

Padraic Cunningham Over a year ago

Sorry, yes I just saw the duplicate output, the edit will do the trick

|

Collectives™ on Stack Overflow

Scrape a table class in Python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related