web-scraping with python 3.6 and beautifulsoup - getting Invalid URL

Question

I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27

This is my code:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I'm getting following output:

<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>

I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?

Use soup = BeautifulSoup(page.text, "lxml") in place of what you have. BeautifulSoup expects a string. page.content gives a byte array. — Bill Bell
– Bill Bell, Commented May 10, 2017 at 3:36
same effect. didn't change anything. with one url it works, with the other not — Zin Yosrim
– Zin Yosrim, Commented May 10, 2017 at 5:02
Interesting. I've found out that the "invalid url" response happens when I query it from an US based IP address. When I did it from a different one - I've got the desired page source.. — alecxe
– alecxe, Commented May 10, 2017 at 10:14
Think yourself lucky. My BeautifulSoup couldn't even cope with the Christie's page, no matter whether I used lxml or html5lib. (I can't believe it.) — Bill Bell
– Bill Bell, Commented May 10, 2017 at 14:21

Roshni Amber · Accepted Answer · 2018-03-20 14:42:51Z

3

change your code as

soup = BeautifulSoup(page.text, "lxml")

If you are using page.content then converting byte array to string would help you out, but you should go with page.text

answered Mar 20, 2018 at 14:42

Roshni Amber

5625 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MD. Khairul Basar · Accepted Answer · 2017-06-24 09:56:53Z

2

This website blocks the requests not coming from any browser thus you get the Invalid URL error. Adding custom headers to the request works fine.

import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)

edited Jun 24, 2017 at 9:56

answered Jun 24, 2017 at 9:50

MD. Khairul Basar

5,11015 gold badges43 silver badges65 bronze badges

Collectives™ on Stack Overflow

web-scraping with python 3.6 and beautifulsoup - getting Invalid URL

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related