3

I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27

This is my code:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I'm getting following output:

<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>

I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:

from bs4 import BeautifulSoup
import requests

page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')

soup = BeautifulSoup(page.content, "lxml")
print(soup)

I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?

7
  • Use soup = BeautifulSoup(page.text, "lxml") in place of what you have. BeautifulSoup expects a string. page.content gives a byte array. Commented May 10, 2017 at 3:36
  • same effect. didn't change anything. with one url it works, with the other not Commented May 10, 2017 at 5:02
  • Interesting. I've found out that the "invalid url" response happens when I query it from an US based IP address. When I did it from a different one - I've got the desired page source.. Commented May 10, 2017 at 10:14
  • I queried from Germany Commented May 10, 2017 at 11:12
  • Think yourself lucky. My BeautifulSoup couldn't even cope with the Christie's page, no matter whether I used lxml or html5lib. (I can't believe it.) Commented May 10, 2017 at 14:21

2 Answers 2

3

change your code as

soup = BeautifulSoup(page.text, "lxml")

If you are using page.content then converting byte array to string would help you out, but you should go with page.text

Sign up to request clarification or add additional context in comments.

Comments

2

This website blocks the requests not coming from any browser thus you get the Invalid URL error. Adding custom headers to the request works fine.

import requests
from bs4 import BeautifulSoup

ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.