3

I'm trying to open up this website using python beautifulsoup and urllib but I keep getting a 403 error. Can someone guide me with this error?

My current code is this;

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'

uClient = uReq(my_url)

but I get the 403 error.

I searched around and tried using the approach below, but it too is giving me the same error.

from urllib.request import Request, urlopen
url="https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})

web_byte = urlopen(req).read()

webpage = web_byte.decode('utf-8')

Any help is appreciated.

4
  • sounds weird, it appears you have to provide some authentication because 403 means that the server is refusing the connection: [description][1], however those links shouldn't need any! [1]: en.wikipedia.org/wiki/HTTP_403 Commented Jan 3, 2018 at 17:03
  • 1
    Any reason you aren't using the requests library, OP? Commented Jan 3, 2018 at 17:14
  • @ Petar - no reason. I'm still a beginner with python and not familiar with requests library. Could you guide me? Commented Jan 3, 2018 at 18:18
  • the requests module is not installed in 3.8 which for me is reason to cause confusion with urllib2, urllib3 and basic pointers to disambiguate being scarce. Was unable to get much joy using requests module, it's not returning a session. How does it get a session object? Commented Jan 21, 2020 at 16:09

1 Answer 1

4

Try to use session() from requests as below:

import requests

my_session = requests.session()
for_cookies = my_session.get("https://www.cubesmart.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://www.cubesmart.com/florida-self-storage/st--petersburg-self-storage/3337.html?utm_source=local&utm_medium=organic&utm_campaign=googlemybusiness&utm_term=3337'

response = my_session.get(my_url, headers=headers, cookies=cookies)
print(response.status_code)  # 200
Sign up to request clarification or add additional context in comments.

5 Comments

@ Andersson - thank you, this is able to open the url, but where do I go from here? I need to write a loop to grab certain parts of the website. How do I print out the html sections of where the data is that I need to capture and write in a csv file?
I prefer to use lxml.html for web-scraping. You can import lxml.html, get page HTML source = lxml.html.fromstring(response.content) and then, for example, get all link text nodes as print([link.text for link in source.xpath("//a[text()]")])
@ Andersson - never came across lxml.html. I'll have to look through it and see how it goes. I'm assuming it has the functionality as grabbing a section of the page, loop through the specific criteria I'm looking for, then moves to the next section and so on. For example; grabs the Medium 10x10 unit description, price, old price, etc, then moves down to the 9x10 and repeats it for all medium, then moves to large.
@D-Ru, It allows you to scrape everything you want to get. Of course you should be familiar with XPath 1.0 or CSS syntax to be able to easily locate required nodes...
@ Andersson - this might be a bit over my head but I'll read about it. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.