Getting Non ascii characters as response from python urllib

Question

import urllib
from urllib.request import urlopen
import xml.etree.ElementTree as etree
response = urllib.request.urlopen("http://regnskaber.virk.dk/32673592/eGJybHN0b3JlOi8vWC1GNzY5MUY0Ny0yMDE0MDMyOV8xMzQxNThfMTc5L3hicmw.xml")

print (response.getcode())

print (response.readline()) # it gets the first line if you need to the check the output

Please help on how to fix this encoding problem.I need to parse XML content.

@haspander-it's not a bulit in one.I have some restrictions to install or use those libraries. — Chinna
– Chinna, Commented Jul 2, 2018 at 13:23
@ 9769953 -the output is : b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\xed=\xd9r\x1b9\x92\xef\x1d\xd1\xffP\xeb\x87\x8d\x99\x08\x89\xe2}x=\x8a\x95,\xb9\xc7\xdb\xb6\xe5\xb04\x9e\xddG\x90\x05\x920\x8bU\x1c\x00\xa4\xc5\x0f\xd8Oi\x7f\x83\xdf\xf9c\x9b\x99@\xdd\x07\x8b\x94\xdcR\xefL\x84\xc3\x92X@"\xef\x0b@\xf1\xd5\xfdXz\xe2%\xfe\xef\xdc/=_\xfd\xe5\xc5\\\xeb\xd5\xcb\xb3\xb3\xaf_\xbf6\xf0\xe3F gg\xedf\xb3s&|\xa5\x99?\xe1/\xcc\xc8\x97\xd3h,\x8ds\'\x13\xd6p\x17g*\x18\x87#\xc6\xc5#\xb8\xaf\xe5\xf6\x92y\x08\xecv\xce\xb9\xbe\x98L\x82\xb5\xaf\xdf\x04r\xf9\xd6\x9f\x04K — Chinna
– Chinna, Commented Jul 2, 2018 at 13:33
Add extra / new information into your question, not as a comment (comments should not be necessary). A traceback would also have been useful. But see my other comment, and my answer. — 9769953
– 9769953, Commented Jul 2, 2018 at 13:41

9769953 · Accepted Answer · 2018-07-03 08:03:40Z

4

The magic bytes 0x1f8b at the start of the response indicate zlib compression. Servers will often compress the data for transport, and browsers automatically uncompress them. Here, you'll have to do the second step yourself:

import urllib
from urllib.request import urlopen
import xml.etree.ElementTree as ET
from io import BytesIO
import gzip
response = urllib.request.urlopen("http://regnskaber.virk.dk/32673592/eGJybHN0b3JlOi8vWC1GNzY5MUY0Ny0yMDE0MDMyOV8xMzQxNThfMT\
c5L3hicmw.xml")
print (response.getcode())

data = response.read()

compdata = BytesIO(data)
text = []
for unit in gzip.GzipFile(fileobj=compdata):
    text.append(unit)
text = b"".join(text)

tree = ET.fromstring(text)
print(tree)

Output:

200
<Element '{http://www.xbrl.org/2003/instance}xbrl' at 0x104d09098>

edited Jul 3, 2018 at 8:03

answered Jul 2, 2018 at 13:39

9769953

12.5k3 gold badges31 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Getting Non ascii characters as response from python urllib

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related