Python 2.7 , issue with decode('utf-8')

Question

I have:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib2 import urlopen

page2 = urlopen('http://pogoda.yandex.ru/moscow/').read().decode('utf-8')

page = urlopen('http://yasko.by/').read().decode('utf-8')

And in line "page ..." I have error "UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 32: invalid continuation byte", but in line "page2 ..." th error is not, why?

From a position of 32 in yasko.by starts Cyrillic symbols, how I get it correctly?

Thanks!

falsetru · Accepted Answer · 2013-11-11 16:00:01Z

2

The content of http://yasko.by/ is encoded with windows-1251, while the content of http://pogoda.yandex.ru/moscow/ is encoded with utf-8.

page = .. line should become:

page = urlopen('http://yasko.by/').read().decode('windows-1251')

answered Nov 11, 2013 at 16:00

falsetru

371k69 gold badges770 silver badges660 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user2350206 Over a year ago

and instead "<title>Главная</title>" I have "<title>\u041e\u0428\u0418\u0411\u041a\u0410</title>" with .decode('windows-1251')

falsetru Over a year ago

@user2350206, Non-ascii characters are represented as u'\uxxxx' form in Python 2.x. Printing it will show you what you expected: print(urlopen('http://yasko.by/').read().decode('windows-1251'))

Collectives™ on Stack Overflow

Python 2.7 , issue with decode('utf-8')

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related