0

I'm wondering how to crawl multiple different pages for one city (e.g. London) from one website using beautiful soup without having to repeat my code over and over.

My goal is to ideally first crawl all pages related to one city

In the following, my code:

session = requests.Session()
session.cookies.get_dict()
url = 'http://www.citydis.com'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)

soup = BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta",  property="configuration")


jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=0"
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))

for item in js_dict:
   headers = js_dict['searchResults']["tours"]
   prices = js_dict['searchResults']["tours"]

for title, price in zip(headers, prices):
   title_final = title.get("title")
   price_final = price.get("price")["original"]

print("Header: " + title_final + " | " + "Price: " + price_final)

Output is the following one:

Header: London Travelcard: 1 Tag lang unbegrenzt reisen | Price: 19,44 €
Header: 105 Minuten London bei Nacht im verdecklosen Bus | Price: 21,21 €
Header: Ivory House London: 4 Stunden mittelalterliches Bankett| Price: 58,92 €
Header: London: Themse Dinner Cruise | Price: 96,62 €

It gives me only back the results of the first page (4 results) but I would like to get all results for London (must be over 200 results)

Could you give me any advice? I guess, I have to count up the pages on the jsonURL but do not know how to do it

UPDATE

Thanks to the help, I´m able to get one step further.

In this case I´m just able to crawl one page (page=0) but I would like to crawl the first 10 pages. Hence, my approach would be the following:

Relevant snippet from the code:

soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta",  property="configuration")

page = 0
while page <= 11:
    page += 1

    jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris&    customerSearch=1&page=" + str(page)
    response = session.get(jsonUrl, headers=headers)
    js_dict = (json.loads(response.content.decode('utf-8')))


   for item in js_dict:
       headers = js_dict['searchResults']["tours"]
       prices = js_dict['searchResults']["tours"]

       for title, price in zip(headers, prices):
           title_final = title.get("title")
           price_final = price.get("price")["original"]

           print("Header: " + title_final + " | " + "Price: " + price_final)

I´m getting the results back for one particular page but not all. In addition to that I´m getting one error message back. Is this related to why I do not get back all of the results?

Output:

Traceback (most recent call last):
File "C:/Users/Scripts/new.py", line 19, in <module>
AttributeError: 'list' object has no attribute 'update'

Thanks for the help

2
  • if you want correct way of crawling webpages look for xpaths .it will make your code a lot less maybe at max 5 lines of what you are doing above. its the standars way for doing anything related to crawling and scraping. Commented Apr 16, 2017 at 18:24
  • Thanks for the info. Will try it out. Nevertheless, could you provide me with some feedback how I can tackle the issue described above with the method that I´m using? Commented Apr 16, 2017 at 20:02

1 Answer 1

1

You really should ensure that your code examples are complete (you have missing imports) and syntactically correct (your code contains indentation issues). In attempting to make a working example I came up with the following.

import requests, json, bs4
session = requests.Session()
session.cookies.get_dict()
url = 'http://www.getyourguide.de'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)  AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)

soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta",  property="configuration")
metaConfigTxt = metaConfig["content"]
csrf = json.loads(metaConfigTxt)["pageToken"]


jsonUrl = "https://www.getyourguide.de/s/results.json?&q=London& customerSearch=1&page=0"
headers.update({'X-Csrf-Token': csrf})
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
print(js_dict.keys())

for item in js_dict:
       headers = js_dict['searchResults']["tours"]
       prices = js_dict['searchResults']["tours"]

       for title, price in zip(headers, prices):
            title_final = title.get("title")
            price_final = price.get("price")["original"]

            print("Header: " + title_final + " | " + "Price: " + price_final)

This gives me way more than four results.

In general you will find that many sites returning JSON will page their replies, offering a fixed number of results per page. In those cases each page but the last will typically contain a key whose value gives you the URL for the next page. It's a simple matter to loop over the pages, and break out of the loop when you detect the absence of that key.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you really much. Will consider your feedback. In this case I´m just able to crawl one page (page=0) but I would like to crawl the first 10 pages. I have posted my approach on my first initial post. Hope, you can guide me to the correct solution. And thanks for your patience:)
A pleasure. I think any further progress will depend on the specifics of the web site, and therefore might fall outside Stackoverflow

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.