0

I'm trying to build a python crawler using requests library. When i use get method i retrieved result look like: THá» THAO. But when i use curl i got THỂ THAO and it is my expected result. Here is my code:

def get_raw_channel():
    r = requests.get('http://vtv.vn/')
    raw_html = r.text
    soup = BeautifulSoup(raw_html)
    o_tags = soup.find_all("option")
    for o_tag in o_tags:
        print o_tag.text
        # raw_channel = RawChannel(o_tag.text.strip(), o_tag['value'])
        # channels_file.write(raw_channel.__str__() + '\n')

Here is my curl cmd: curl http://vtv.vn/

Question: why the results is different? How can i achieve curl's result by using requests?

3
  • What is the encoding of the response body? Commented Feb 9, 2015 at 8:17
  • @LutzHorn (Date: Mon, 09 Feb 2015 07:59:34 GMT, Content-Type: text/html, Transfer-Encoding: chunked, Connection: close, Vary: Accept-Encoding ,Server: vtv-rp this is curl response header. And: {'via': '1.1 TMG', 'proxy-connection': 'Keep-Alive', 'transfer-encoding': 'chunk ed', 'vary': 'Accept-Encoding', 'server': 'vtv-rp', 'connection': 'Keep-Alive', 'date': 'Mon, 09 Feb 2015 08:19:52 GMT', 'content-type': 'text/html'} is requests response headers. Commented Feb 9, 2015 at 8:20
  • @LutzHorn i dont see encoding of response. But i think it is utf-8 Commented Feb 9, 2015 at 8:22

1 Answer 1

1

I tried your code and in my case the encoding was 'ISO-8859-1', try to encode your data into UTF-8 before process it in BS, something like:

...
raw_html = r.text.encode("utf-8")
soup = BeautifulSoup(raw_html)
...

UPDATE: I made some more tests, looks like everything worked for me because I explicitly set encoding for request, take a look

In [1]: import requests
In [2]: from BeautifulSoup import BeautifulSoup
In [3]: r = requests.get('http://vtv.vn/')
In [4]: r.encoding = "utf-8"
In [5]: raw_html = r.text
In [6]: soup = BeautifulSoup(raw_html)
In [7]: soup.findAll("option")
Out[7]: 
[<option value="1">
 VTV1</option>,
 ... stripped out some output ...

 VTVCab3 - Thể thao TV</option>,
 <option value="13">

 ... stripped out some output ...
]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank for your answer, but it does not work with me :(

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.