0

I am trying to write a code that gets the html code from a website that the user enters. I am required to write this without using urllib or other libraries of that sort.

 from socket import *


url = (input("Please enter url: "))
host=gethostbyname(url)

clientSocket = socket(AF_INET, SOCK_STREAM)
clientSocket.connect((host,80))

clientSocket.send(("GET " + host + "HTTP/1.1\n\n").encode("UTF-8"))

file = clientSocket.recv(1024)
print("The html code: ", file.decode("UTF-8"))
clientSocket.close()

The code runs fine. However, when I input a website such as "www.stackoverflow.com" I get a "bad request" response from the host:

The html code:  HTTP/1.1 400 Bad Request

Date: Wed, 23 Mar 2016 16:14:27 GMT

Content-Type: text/html

Content-Length: 177

Connection: close

Server: -nginx

CF-RAY: -



<html>

<head><title>400 Bad Request</title></head>

<body bgcolor="white">

<center><h1>400 Bad Request</h1></center>

<hr><center>cloudflare-nginx</center>

</body>

</html>

What would be the correct request in order to get the actual html code from the server. Thank you

2 Answers 2

1

A hostname is not a URL. Your script appears to be prompting for only a hostname since you're using gethostbyname(). The GET request expects to see a URI for its first argument. You also need to send carriage returns with your line feeds and you need two to terminate the GET request. You should something like:

clientSocket.send(("GET / HTTP/1.1\r\n\r\n").encode("UTF-8"))

Also if all you want to do is download a URL, use a library like urllib2 which takes care of all the HTTP protocol details for you. For example:

import urllib2

r = urllib2.urlopen('http://google.com/')
print r.read()
Sign up to request clarification or add additional context in comments.

Comments

0

You're not speaking HTTP/1.1, yet you're stating so on the first line.

First of all, the token following GET must be an absolute path on the server; thus start with /.

Second, a HTTP/1.1 request must include the Host: header.

And third, your simple client should probably say Connection: close since it does not handle chunked connections.


You might have better success with the following script:

from socket import *

host = gethostbyname('stackoverflow.com')
clientSocket = socket(AF_INET, SOCK_STREAM)
clientSocket.connect((host,80))
clientSocket.send((
    "GET / HTTP/1.1\r\n"
    "Host: stackoverflow.com\r\n"
    "Connection: close\r\n\r\n").encode('utf-8'))

file = clientSocket.recv(1024)
print("The html code: ", file.decode("UTF-8"))
clientSocket.close()

4 Comments

Thank you! however, my professor is requesting for the user to input the url instead of me having it there in the first place. This is where I am having issues because different sites have different paths and I wouldn't know how to generalize it.
then use urlparse to parse it to components
excuse my ignorance but I'm not sure how to make that work. I'm only in intro to networking and my professor is not being very helpful. Everything I've done so far I've gotten it by my own research but I feel like I'm at a road block because I don't know much.
UPDATE: I was able to get it. '("GET / HTTP/1.1\r\n" "Host:"+ url +"\r\n" "Connection: close\r\n\r\n")'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.