Scraping html in python when you have more than one class with the same name

Question

maybe my terminology is a bit off here, but hope you get the jist. I'm trying to scrape data off a food review website which has three ratings: happy, neutral, unhappy. The number of counts of each in the website written like:

<div class="col  PL20">
  <div class="sprite-sr2-face-smile1"></div>
  <div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
  <div class="sprite-sr2-face-ok2 MT20"></div>
  <div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
  <div class="sprite-sr2-face-cry2 MT20"></div>
  <div class="sr2_score_m">2</div>
</div>

So in this case the number of happy counts is 25, neutral is 17 and unhappy is 2. Problem is what with my python code below I cannot differentiate between the neutral count and the unhappy count because the share the same class, is there a way around this?

# using BeautifulSoup4 and lxml
import urllib2 
from bs4 import BeautifulSoup  
soup = BeautifulSoup(urllib2.urlopen('http://www.openrice.com/_
en/hongkong/restaurant/central-open-kitchen/136799').read())

happy = soup.find('div', attrs={'class': 'sr2_score_l'})
print "happy rating, " + happy.string

neutral = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string

unhappy = soup.find('div', attrs={'class': 'sr2_score_m'})
print "neutral rating, " + neutral.string

alecxe · Accepted Answer · 2015-09-14 16:14:04Z

1

face-smile, face-ok and face-cry parts of class names are your indicators:

happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").text
ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").text
unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").text

Example code (with a nice reusable function):

import re

from bs4 import BeautifulSoup


def print_reviews_count(soup):
    indicators = {
        "happy": "face-smile",
        "ok": "face-ok",
        "unhappy": "face-cry",
    }

    for key, class_name in indicators.iteritems():
        count = soup.find("div", class_=re.compile(class_name)).find_next_sibling("div").text
        print(key, count)


source_code = """
<div class="col  PL20">
  <div class="sprite-sr2-face-smile1"></div>
  <div class="sr2_score_l">25</div>
</div>
<div class="col MR20 MT20 ML20">
  <div class="sprite-sr2-face-ok2 MT20"></div>
  <div class="sr2_score_m">17</div>
</div>
<div class="col ML10 MT20">
  <div class="sprite-sr2-face-cry2 MT20"></div>
  <div class="sr2_score_m">2</div>
</div>
"""

soup = BeautifulSoup(source_code, "lxml")
print_reviews_count(soup)

Prints:

('ok', u'17')
('unhappy', u'2')
('happy', u'25')

answered Sep 14, 2015 at 16:14

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

pakkunrob Over a year ago

Thanks. The first block of code works like, but when i try execute the second block in a different pane, it keeps saying the variable indicators isn't defined

Raimund Krämer · Accepted Answer · 2015-09-14 15:49:55Z

0

I see two possible solutions:

Add another html class if you can.

or

Search for the class "sprite-sr2-face-cry2" in the line before the one where you found "sr2_score_m".

To do this you could create a list of strings from your html file using .splitlines(), then iterate over it and search for both classes.

answered Sep 14, 2015 at 15:49

Raimund Krämer

1,3101 gold badge12 silver badges30 bronze badges

Comments

pakkunrob · Accepted Answer · 2015-09-15 10:58:01Z

Actually using the help from you guys I've managed to write quite a nice function that should allow me to reuse the function for a list of website urls

import re
import urllib2 
from bs4 import BeautifulSoup

website_list = [urlA, urlB....,urlX]

def ratings(website):
    soup = BeautifulSoup(urllib2.urlopen(website).read())
    happy = soup.find("div", class_=re.compile(r"face-smile")).find_next_sibling("div").string
    ok = soup.find("div", class_=re.compile(r"face-ok")).find_next_sibling("div").string
    unhappy = soup.find("div", class_=re.compile(r"face-cry")).find_next_sibling("div").string
    print "happy rating, " + happy.string
    print "ok rating, " + ok.string
    print "unhappy rating, " + unhappy.string

for website in website_list:
    ratings(website)

Collectives™ on Stack Overflow

Scraping html in python when you have more than one class with the same name

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related