Python Web Scrape - 403 Error

Question

I'm trying to open up this website using python beautifulsoup and urllib but I keep getting a 403 error. Can someone guide me with this error?

My current code is this;

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'

uClient = uReq(my_url)

but I get the 403 error.

I searched around and tried using the approach below, but it too is giving me the same error.

from urllib.request import Request, urlopen
url="https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

Any help is appreciated.

sounds weird, it appears you have to provide some authentication because 403 means that the server is refusing the connection: [description][1], however those links shouldn't need any! [1]: en.wikipedia.org/wiki/HTTP_403 — lorenzori
– lorenzori, Commented Jan 3, 2018 at 17:03
@ Petar - no reason. I'm still a beginner with python and not familiar with requests library. Could you guide me? — D-Ru
– D-Ru, Commented Jan 3, 2018 at 18:18
the requests module is not installed in 3.8 which for me is reason to cause confusion with urllib2, urllib3 and basic pointers to disambiguate being scarce. Was unable to get much joy using requests module, it's not returning a session. How does it get a session object? — user337598
– user337598, Commented Jan 21, 2020 at 16:09

Andersson · Accepted Answer · 2018-01-03 19:34:07Z

4

Try to use session() from requests as below:

import requests

my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'

response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code)  # 200

answered Jan 3, 2018 at 19:34

Andersson

52.8k18 gold badges83 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

D-Ru Over a year ago

@ Andersson - thank you, this is able to open the url, but where do I go from here? I need to write a loop to grab certain parts of the website. How do I print out the html sections of where the data is that I need to capture and write in a csv file?

Andersson Over a year ago

I prefer to use lxml.html for web-scraping. You can import lxml.html, get page HTML source = lxml.html.fromstring(response.content) and then, for example, get all link text nodes as print([link.text for link in source.xpath("//a[text()]")])

D-Ru Over a year ago

@ Andersson - never came across lxml.html. I'll have to look through it and see how it goes. I'm assuming it has the functionality as grabbing a section of the page, loop through the specific criteria I'm looking for, then moves to the next section and so on. For example; grabs the Medium 10x10 unit description, price, old price, etc, then moves down to the 9x10 and repeats it for all medium, then moves to large.

Andersson Over a year ago

@D-Ru, It allows you to scrape everything you want to get. Of course you should be familiar with XPath 1.0 or CSS syntax to be able to easily locate required nodes...

D-Ru Over a year ago

@ Andersson - this might be a bit over my head but I'll read about it. Thanks

Collectives™ on Stack Overflow

Python Web Scrape - 403 Error

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related