1

I am trying to get my head around how data scraping works when you look past HTML (i.e. DOM scraping).

I've been trying to write a simple Python code to automatically retrieve the number of people that have seen a specific ad: the part where it says '3365 people viewed Peter's place this week.'

At first I tried to see if that was displayed in the HTML code but could not find it. Did some research and saw that not everything will be in the code as it can be processes by the browser through JavaScript or other languages that I don't quite understand yet. I then inspected the element and realised that I would need to use the Python library 'retrieve' and 'lxml.html'. So I wrote this code:

import requests
import lxml.html

response = requests.get('https://www.airbnb.co.uk/rooms/501171')
resptext = lxml.html.fromstring(response.text)
final = resptext.text_content()
finalu = final.encode('utf-8')

file = open('file.txt', 'w')

file.write(finalu) 

file.close()

With that, I get a code with all the text in the web page, but not the text that I am looking for! Which is the magic number 3365.

So my question is: how do I get it? I have thought that maybe I am not using the correct language to get the DOM, maybe it is done with JavaScript and I am only using lxml. However, I have no idea.

2 Answers 2

2

The DOM element you are looking at is updated after page load with what looks like an AJAX call with the following request URL:

https://www.airbnb.co.uk/rooms/501171/personalization.json

If you GET that URL, it will return the following JSON data:

{
   "extras_price":"£30",
   "preview_bar_phrases":{
      "steps_remaining":"<strong>1 step</strong> to list"
   },
   "flag_info":{

   },
   "user_is_admin":false,
   "is_owned_by_user":false,
   "is_instant_bookable":true,
   "instant_book_reasons":{
      "within_max_lead_time":null,
      "within_max_nights":null,
      "enough_lead_time":true,
      "valid_reservation_status":null,
      "not_country_or_village":true,
      "allowed_noone":null,
      "allowed_everyone":true,
      "allowed_socially_connected":null,
      "allowed_experienced_guest":null,
      "is_instant_book_host":true,
      "guest_has_profile_pic":null
   },
   "instant_book_experiments":{
      "ib_max_nights":14
   },
   "lat":51.5299601405844,
   "lng":-0.12462748035984603,
   "localized_people_pricing_description":"&pound;30 / night after 2 guests",
   "monthly_price":"&pound;4200",
   "nightly_price":"&pound;150",
   "security_deposit":"",
   "social_connections":{
      "connected":null
   },
   "staggered_price":"&pound;4452",
   "weekly_price":"&pound;1050",
   "show_disaster_info":false,
   "cancellation_policy":"Strict",
   "cancellation_policy_link":"/home/cancellation_policies#strict",
   "show_fb_cta":true,
   "should_show_review_translations":false,
   "listing_activity_data":{
      "day":{
         "unique_views":226,
         "total_views":363
      },
      "week":{
         "unique_views":3365,
         "total_views":5000
      }
   },
   "should_hide_action_buttons":false
}

If you look under "listing_activity_data" you will find the information you seek. Appending /personalization.json to any room URL seems to return this data (for now).

Update per the user agent issues

It looks like they are filtering requests to this URL based on user agent. I had to set the user agent on the urllib request in order to fix this:

import urllib2
import json


headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://www.airbnb.co.uk/rooms/501171/personalization.json', None, headers)
json = json.load(urllib2.urlopen(req))

print(json['listing_activity_data']['week']['unique_views'])
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you so much! That was so helpful and so much easier than I though! I just have a couple more questions: How did you know that json page existed, and for some reason when I try the full link in the code that I have I still do not get the 3365 views number :(? It does work however when I put it in chrome.
See my update above. I knew there was an AJAX call by looking at the ID field on the DIV and searching through all of the Javascript for a matching controller. Once you know it is an AJAX call, all you have to do is capture all of the requests in the dev tools and look for the one that has the data you need in the response.
Perfect! Thank you for the explanation and for helping me solve the problem! I'll keep on coding! :D
0

so first of all you need to figure out if that section of code has any unique tags. So if you look at the HTML tree you have

html > body > #room > ....... > #book-it-urgency-commitment > div > div > ... > div#media-body > b

The data you need is stored in a 'b' tag. I'm not sure about using lxml, but I usually use BeautifulSoup for my scraping.

You can reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/ it's pretty straight forward.

2 Comments

Thanks, will give it a try although the other answer did it for me
:) yup looking at the json works as well but if you want to scrape other stuff thats not in the json this would be the way

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.