1

I'm trying to get the visible content from here using socket but unfortunately I'm getting an error when I execute my script. As I'm very new to code using socket, I can't understand as to where I'm going wrong.

My code:

import socket

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
host_ip = socket.gethostbyname('data.pr4e.org')
s.connect((host_ip,80))
cmd = "GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n".encode()
s.send(cmd)

while True:
    data = s.recv(1024)
    if (len(data) <1 ):
        break
    print(data.decode())
s.close()

Error I'm getting:

400 Bad Request

Your browser sent a request that this server could not understand.
3
  • This error comes from the server. Your script is correctly reading the response. Commented Oct 27, 2018 at 20:49
  • 1
    Might as well just use the requests library.... Commented Oct 27, 2018 at 22:17
  • Here is a relevant SO question. Commented Oct 28, 2018 at 0:39

2 Answers 2

1

I was able to obtain the desired result by adding \r\n\r\n to the end of the request command, rather than the original \n\n:

import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((socket.gethostbyname('data.pr4e.org'), 80))
s.sendall("GET http://data.pr4e.org/romeo.txt HTTP/1.0\r\n\r\n".encode())
print(s.recv(1024))

Output:

...
Content-Type: text/plain\r\n\r\nBut soft what light through yonder window breaks\nIt is the east and Juliet is the sun\nArise fair sun and kill the envious moon\nWho is already sick and pale with grief\n'
Sign up to request clarification or add additional context in comments.

2 Comments

Just use the requests library, then you won’t also hit a host of other problems that requests has already solved.
@barny I do agree, however, the OP is specifically trying to use socket.
1

There are multiple problems here:

  1. It is uncommon to include http://data.pr4e.org after GET (see RFC 7230) unless talking to a proxy. You will usually write GET /romeo.txt and provide the hostname in a separate Host: data.pr4e.org header. Servers are required to support the form you used, but they might violate the standard and choke on it. This is especially likely if you claim to be using HTTP/1.0, which is stricter and forbids this form unless talking to a proxy.
  2. Nobody uses HTTP/1.0 any more. All modern browsers and other HTTP clients use HTTP/1.1 or HTTP/2. Some servers will support HTTP/1.0, but it's not mandatory. Note that HTTP/1.1 makes the Host: header mandatory, even when you put the full URL after GET.
  3. HTTP/1.0 uses \r\n ("CRLF") as a newline (see RFC 1945), so \n may not always be understood. Again, some servers will handle it correctly, but it is non-conforming. The use of CRLF has been carried over to HTTP/1.1.
  4. print(data.decode()) will add an extra newline at the end of data. This could become an issue if TCP fragments a large HTTP response so that recv() returns multiple nonempty strings. Use print(data.decode(), end='') instead.

2 Comments

This is all perfectly valid critique of the question, but that doesn’t make it an answer.
@barny: The \r\n issue is the cause of OP's particular error in this case, so I don't see how this is "not an answer" at all.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.