3

I am trying to read all the comments for a given product , this is to both learn python and also for a project,to simplify my task I chose a product randomly to code.

The link I want to read is Amazons and I used urllib to to open the link

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')

After reading the link into "amazon" variable when I display amazon , I get the below message

print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>

so I read online , and found I need to use read command to read the source ,but sometimes it gives me a webpage kind of result other times not

print(amazon.read())
b''

How do I read the page, and pass it to beautiful soup ?

Edit 1

I did use request.get , and when I check what is present in the text of the retrieved page, I found the below content , which doe not match with the website link.

print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">

<!--
        To discuss automated access to Amazon data please contact [email protected].
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->

<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>

<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b><a href="http://www.amazon.in/ref=cs_503_link/">Go to the Amazon.in home page to continue shopping</a></b>
</font>

</center>
</body>
</html>
1
  • You're getting that error; You would most likeley need to pass in extra headers to your requests. Look up Urllib set headers. You need to act as a person in a browser by passing in User-Agent and other attributes. Commented Oct 28, 2016 at 19:23

4 Answers 4

3

Using your current library urllib. This is what you could do! Use .read() to get HTML. Then pass it into BeautifulSoup like this. Keep in mind amazon is heavy-anti-scraping website. The likelihood of you getting different result might be because the HTML is wrapped inside JavaScript. For that you might have to use Selenium or Dryscrape. You may also need to pass in headers/Cookies and extra attributes into your request.

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)

EDIT ---- Turns out you're using requests now. I could get 200 response using requests passing in my headers like this.

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]

--- Using Dryscrape

import dryscrape
from bs4 import BeautifulSoup

sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup

##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html
Sign up to request clarification or add additional context in comments.

7 Comments

I did use your request, but ws not able to retrieve anything, this might be because as you say that amazon is anti scraping site, were you ble to run the code ?
Hey. I'll update the thread to make it use python-requests. This should work!
Check out the Edited version!
@lollerskates That is indeed very true. That is why I mentioned Selenium/Dryscrape at the starting of this thread.
oops, my bad. I might have replied to wrong comment
|
1

I personally would use the requests library for this and not urllib. Requests has more features

import requests

From there something like:

resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)

Should answer the mail for this one as it is rather simple http request

Edit: Based on your error, you are going to have to research parameters to pass to make your requests look correct. In general with requests it'll look something like this (obviously with the values you discover -- check your browsers debug/developer options to check your network traffic and see what you are sending to amazon when using a browser):

url = "https://www.base.url.here"
params = {
    'param1': 'value1'
     .....
}
resp = requests.get(url,params)

Comments

0

For web scrapping use requests and BeautifulSoup modules in python3.

Installing BeautifulSoup:

pip install beautifulsoup4

Use appropriate headers while sending request.

headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

Scrap.py

from bs4 import BeautifulSoup

import requests

url = "https://www.amazon.in/s/ref=mega_elec_s23_2_3_1_1?rh=i%3Acomputers%2Cn%3A3011505031&ie=UTF8&bbn=976392031"

headers = {
'authority': 'www.amazon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}

response = requests.get(f"{url}", headers=headers)

with open("webpg.html","w", encoding="utf-8") as file: # saving html file to disk
    file.write(response.text)

bs = BeautifulSoup(response.text, "html.parser")
print(bs) # displaying html file use bs.prettify() for making the document more readable

Comments

0

You can't scrape Amazon directly anymore, their scraper detection is strong and they don't allow illegal scraping (behind the login) anymore. You have to use an API that do it legally or scrape another websites that provide same info as Amazon and they don't mind you to scrape their data directly using bs4 and request. I've build an app for that in Pyhton and will post it to Github sooner as I get my PhD this month I hope. I named my App as "Amazon Book Scraper", you will find it in 2024 June on Github.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.