2

The url to scrape : http://aqicn.org/city/chennai//us-consulate/
The reason to do so was to obtain the "pm2.5aqi", "temperature", "humidity", "pressure" data from the website.

The problem : The data scraped and data viewed from the source of the website is NOT same.

The code I used to scrape and display data :

from bs4 import BeautifulSoup
import urllib2
import urllib
import cookielib

url="http://aqicn.org/city/chennai//us-consulate/"
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler(),
        urllib2.HTTPHandler(debuglevel=0),
        urllib2.HTTPSHandler(debuglevel=0),
        urllib2.HTTPCookieProcessor(cj))
page=opener.open(url)
page_soup=BeautifulSoup(page.read(),'html.parser')

print "curr, max, min pmi2.5 aqi : ",
print page_soup.find('td',id='cur_pm25').string,"     ",page_soup.find('td',id='max_pm25').string,"  ",page_soup.find('td',id='min_pm25').string

print "curr, max, min temp : ",
print page_soup.find('td',id='cur_t').span.string,"  ",page_soup.find('td',id='max_t').span.string,"  ",page_soup.find('td',id='min_t').span.string

print "curr, max, min pressure : ",
print page_soup.find('td',id='cur_p').string,"  ",page_soup.find('td',id='max_p').string,"  ",page_soup.find('td',id='min_p').string

print "curr, max, min humidity : ",
print page_soup.find('td',id='cur_h').string,"  ",page_soup.find('td',id='max_h').string,"  ",page_soup.find('td',id='min_h').string



What I was doing : I manually identified the tags from the page source which contain the values and printed the same tag's value from the data scraped.

Surprisingly the data displayed and the data present on the page' source was different.

My scraped data :

curr, max, min pmi2.5 aqi :  143    157    109
curr, max, min temp :  24    30    24
curr, max, min pressure :  1012    1014    1010
curr, max, min humidity :  100    100    62


The data on the website was : (the data can be verified from the link, but the data might become outdated, as it is real time data)

curr, max, min pmi2.5 aqi : 108   166   94
curr, max, min temp : 27   30   24
curr, max, min pressure : 1013   1014   1010
curr, max, min humidity : 83   100   62


I checked the same tags again in the page source, and identified the same area by making python display the soup using :

print page_soup.prettify()


But the data was NOT same.
How is this possible? Can someone please explain as to why this weird behaviour occurs? And suggest a work-around / solution for this problem?

1 Answer 1

1

The real time data is rendered by a script and it replaces the default data which is your scraped data. I don't know why they put default data in because it is misleading and it should always be replaced. Except of course when it isn't and then it would be better to show an error message than the wrong data.

If you want to scrape this look into a web driver like selenium to render the page for you and then run that through beautiful soup.

Sign up to request clarification or add additional context in comments.

7 Comments

So you mean to say that while my scraping script fetches the page, the server of the web-site fails to render real data ( failure of its own script ) and thus all I am retreiving is the default data?
No, the default data is a red herring which is in the page unfortunately, it means nothing to you. Beautiful Soup can't run the javascript to populate the page. You need something like selenium which can run the javascript and populate the page.
Aha, so the javascript in the return data has to be run to place the real data in the right place, but soup doesn't run the script. ( I will check the javascript from the page source now, assuming what I have now understood is right.)
Exactly, good luck! Usually a print soup will show you that the data hasn't been rendered. The confusing thing is they put in that default data.
This might help get you started: stackoverflow.com/questions/33580004/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.