3

I am trying to scrape http://emojipedia.org/emoji/ , but I am not sure what is the most efficient way to do so. What I would like to scrape is found inside the table class ="emoji_list". I would like to save the stuff inside each "td" in separate columns. The output will be like the following where each line represent an emoji:

Col1_Link               Col2_emoji      Col3_Comment        Col4_UTF
"/emoji/%F0%9F%98%80/"       😀        Grinning Face         U+1F600

I have written the following code so far, but I am not sure what is the best way to do that.

import requests
from bs4 import BeautifulSoup 
import urllib
import re    

url = "http://emojipedia.org/emoji/"
html = urllib.urlopen(url)
soup = BeautifulSoup(html)
soup.findAll('tr', limit=2)

Many thanks in advance for your help.

1 Answer 1

3

soup.findAll('tr', limit=2) won't do much considering that just gets the first two trs on the page. You need to first find all the rows of the table then extract what you want which is inside the two tds in each tr:

import requests
from bs4 import BeautifulSoup
url = "http://emojipedia.org/emoji/"
html = requests.get(url).content

soup = BeautifulSoup(html)
table = soup.select_one("table.emoji-list")

for row in table.find_all("tr")[:5]:
    td1, td2 = row.find_all("td")
    em, desc =  td1.text.split(None, 1)
    print(td1.a["href"], em, desc, td2.text)

Another way would be to only get text without splitting would be to get the text from the a tag excluding the child text with find(text=True, recursive=False)

for row in table.find_all("tr"):
    td1, td2 = row.find_all("td")
    print(td1.a["href"], td1.a.span.text, td1.a.find(text=True, recursive=False), td2.text)

Also I would stick to using requests over urllib.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks a lot! The "table = soup.select_one("table.emoji-list") " didn't work for me, but I used "table = soup.find('table', {'class': 'emoji-list'})"
@morfara, interesting, did you use requests to get the source?
I am new to scraping and I have to admit that it is so confusing which libraries is the best to use. Do you know any good resources that explain why requests is better over urllib? P.s. 1 Yes, I used it, but it gives me "TypeError: 'NoneType' object is not callable" P.s. 2 For the td1.text, I get "u'\U0001f600 Grinning Face' " as the output. Is there any easy way to keep only the English and drop the unicode? Thanks again!
What version of bs4 are you using? requests is just as the documentation states http for humans, it makes a lot of complicated http requests very simple, it would be part of the standard lib but the author does not want to do that for various reasons. If you are doing http in python, unless you really have a grasp of it then you should be using requests.
Sorry, yes I just saw the duplicate output, the edit will do the trick
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.