I'm wondering how to crawl multiple different pages for one city (e.g. London) from one website using beautiful soup without having to repeat my code over and over.
My goal is to ideally first crawl all pages related to one city
In the following, my code:
session = requests.Session()
session.cookies.get_dict()
url = 'http://www.citydis.com'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta", property="configuration")
jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=0"
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
for item in js_dict:
headers = js_dict['searchResults']["tours"]
prices = js_dict['searchResults']["tours"]
for title, price in zip(headers, prices):
title_final = title.get("title")
price_final = price.get("price")["original"]
print("Header: " + title_final + " | " + "Price: " + price_final)
Output is the following one:
Header: London Travelcard: 1 Tag lang unbegrenzt reisen | Price: 19,44 €
Header: 105 Minuten London bei Nacht im verdecklosen Bus | Price: 21,21 €
Header: Ivory House London: 4 Stunden mittelalterliches Bankett| Price: 58,92 €
Header: London: Themse Dinner Cruise | Price: 96,62 €
It gives me only back the results of the first page (4 results) but I would like to get all results for London (must be over 200 results)
Could you give me any advice? I guess, I have to count up the pages on the jsonURL but do not know how to do it
UPDATE
Thanks to the help, I´m able to get one step further.
In this case I´m just able to crawl one page (page=0) but I would like to crawl the first 10 pages. Hence, my approach would be the following:
Relevant snippet from the code:
soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta", property="configuration")
page = 0
while page <= 11:
page += 1
jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=" + str(page)
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
for item in js_dict:
headers = js_dict['searchResults']["tours"]
prices = js_dict['searchResults']["tours"]
for title, price in zip(headers, prices):
title_final = title.get("title")
price_final = price.get("price")["original"]
print("Header: " + title_final + " | " + "Price: " + price_final)
I´m getting the results back for one particular page but not all. In addition to that I´m getting one error message back. Is this related to why I do not get back all of the results?
Output:
Traceback (most recent call last):
File "C:/Users/Scripts/new.py", line 19, in <module>
AttributeError: 'list' object has no attribute 'update'
Thanks for the help
xpaths.it will make your code a lot less maybe at max 5 lines of what you are doing above. its the standars way for doing anything related to crawling and scraping.