Python data scraping - Elementary concepts

Question

I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).

I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'

At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:

import requests
import lxml.html

response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')

file = open('file.txt', 'w')

file.write(finalu) 

file.close()

With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.

So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.

score 2 · Accepted Answer · 2015-05-03 15:40:13Z

2

The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:

https://www.airbnb.co.uk/rooms/501171/personalization.json

If you GET that URL, it will return the following JSON data:

{
   "extras_price":"&pound;30",
   "preview_bar_phrases":{
      "steps_remaining":"<strong>1 step</strong> to list"
   },
   "flag_info":{

   },
   "user_is_admin":false,
   "is_owned_by_user":false,
   "is_instant_bookable":true,
   "instant_book_reasons":{
      "within_max_lead_time":null,
      "within_max_nights":null,
      "enough_lead_time":true,
      "valid_reservation_status":null,
      "not_country_or_village":true,
      "allowed_noone":null,
      "allowed_everyone":true,
      "allowed_socially_connected":null,
      "allowed_experienced_guest":null,
      "is_instant_book_host":true,
      "guest_has_profile_pic":null
   },
   "instant_book_experiments":{
      "ib_max_nights":14
   },
   "lat":51.5299601405844,
   "lng":-0.12462748035984603,
   "localized_people_pricing_description":"&pound;30 / night after 2 guests",
   "monthly_price":"&pound;4200",
   "nightly_price":"&pound;150",
   "security_deposit":"",
   "social_connections":{
      "connected":null
   },
   "staggered_price":"&pound;4452",
   "weekly_price":"&pound;1050",
   "show_disaster_info":false,
   "cancellation_policy":"Strict",
   "cancellation_policy_link":"/home/cancellation_policies#strict",
   "show_fb_cta":true,
   "should_show_review_translations":false,
   "listing_activity_data":{
      "day":{
         "unique_views":226,
         "total_views":363
      },
      "week":{
         "unique_views":3365,
         "total_views":5000
      }
   },
   "should_hide_action_buttons":false
}

If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).

Update per the user agent issues

It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:

import urllib2
import json


headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))

print(json['listing_activity_data']['week']['unique_views'])

edited May 3, 2015 at 15:40

answered May 3, 2015 at 15:24

user862319

Sign up to request clarification or add additional context in comments.

3 Comments

jmramosfran Over a year ago

Thank you so much! That was so helpful and so much easier than I though! I just have a couple more questions: How did you know that json page existed, and for some reason when I try the full link in the code that I have I still do not get the 3365 views number :(? It does work however when I put it in chrome.

user862319 Over a year ago

See my update above. I knew there was an AJAX call by looking at the ID field on the DIV and searching through all of the Javascript for a matching controller. Once you know it is an AJAX call, all you have to do is capture all of the requests in the dev tools and look for the one that has the data you need in the response.

jmramosfran Over a year ago

Perfect! Thank you for the explanation and for helping me solve the problem! I'll keep on coding! :D

d123 · Accepted Answer · 2015-05-03 15:35:10Z

0

so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have

html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b

The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.

You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.

answered May 3, 2015 at 15:35

d123

1,6372 gold badges14 silver badges22 bronze badges

2 Comments

jmramosfran Over a year ago

Thanks, will give it a try although the other answer did it for me

d123 Over a year ago

:) yup looking at the json works as well but if you want to scrape other stuff thats not in the json this would be the way

Collectives™ on Stack Overflow

Python data scraping - Elementary concepts

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related