2

maybe my terminology is a bit off here, but hope you get the jist. I'm trying to scrape data off a food review website which has three ratings: happy, neutral, unhappy. The number of counts of each in the website written like:

<div class="col  PL20">
  <div class="sprite-sr2-face-smile1"></div>
  <div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
  <div class="sprite-sr2-face-ok2 MT20"></div>
  <div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
  <div class="sprite-sr2-face-cry2 MT20"></div>
  <div class="sr2_score_m">2</div>
</div>

So in this case the number of happy counts is 25, neutral is 17 and unhappy is 2. Problem is what with my python code below I cannot differentiate between the neutral count and the unhappy count because the share the same class, is there a way around this?

# using BeautifulSoup4 and lxml
import urllib2 
from bs4 import BeautifulSoup  
soup = BeautifulSoup(urllib2.urlopen('http://www.openrice.com/_
en/hongkong/restaurant/central-open-kitchen/136799').read())

happy = soup.find('div', attrs={'class': 'sr2_score_l'})
print "happy rating, " + happy.string

neutral = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string

unhappy = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string

3 Answers 3

1

face-smile, face-ok and face-cry parts of class names are your indicators:

happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").text
ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").text
unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").text

Example code (with a nice reusable function):

import re

from bs4 import BeautifulSoup


def print_reviews_count(soup):
    indicators = {
        "happy": "face-smile",
        "ok": "face-ok",
        "unhappy": "face-cry",
    }

    for key, class_name in indicators.iteritems():
        count = soup.find("div", class_=re.compile(class_name)).find_next_sibling("div").text
        print(key, count)


source_code = """
<div class="col  PL20">
  <div class="sprite-sr2-face-smile1"></div>
  <div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
  <div class="sprite-sr2-face-ok2 MT20"></div>
  <div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
  <div class="sprite-sr2-face-cry2 MT20"></div>
  <div class="sr2_score_m">2</div>
</div>
"""

soup = BeautifulSoup(source_code, "lxml")
print_reviews_count(soup)

Prints:

('ok', u'17')
('unhappy', u'2')
('happy', u'25')
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks. The first block of code works like, but when i try execute the second block in a different pane, it keeps saying the variable indicators isn't defined
0

I see two possible solutions:

  • Add another html class if you can.

or

  • Search for the class "sprite-sr2-face-cry2" in the line before the one where you found "sr2_score_m".

To do this you could create a list of strings from your html file using .splitlines(), then iterate over it and search for both classes.

Comments

0

Actually using the help from you guys I've managed to write quite a nice function that should allow me to reuse the function for a list of website urls

import re
import urllib2 
from bs4 import BeautifulSoup

website_list = [urlA, urlB....,urlX]

def ratings(website):
    soup = BeautifulSoup(urllib2.urlopen(website).read())
    happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").string
    ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").string
    unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").string
    print "happy rating, " + happy.string
    print "ok rating, " + ok.string
    print "unhappy rating, " + unhappy.string

for website in website_list:
    ratings(website)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.