1

I thought I had this, but then it all fell apart. I'm starting a scraper that pulls data from a chinese website. When I isolate and print the elements I am looking for everything works fine ("print element" and "print text"). However, when I add those elements to a dictionary and then print the dictionary (print holder), everything goes all "\x85\xe6\xb0" on me. Trying to .encode('utf-8') as part of the appending process just throws up new errors. This may not ultimately matter because it is just going to be dumped into a CSV, but it makes troubleshooting really hard. What am I doing when I add the element to the dictionary to mess up the encoding?

thanks!

from bs4 import BeautifulSoup
import urllib
#csv is for the csv writer
import csv

#intended data structure is list of dictionaries
# holder = [{'headline': TheHeadline, 'url': TheURL, 'date1': Date1, 'date2': Date2, 'date3':Date3}, {'headline': TheHeadline, 'url': TheURL, 'date1': Date1, 'date2': Date2, 'date3':Date3})


#initiates the dictionary to hold the output

holder = []

txt_contents = "http://sousuo.gov.cn/s.htm?q=&n=80&p=&t=paper&advance=true&title=&content=&puborg=&pcodeJiguan=%E5%9B%BD%E5%8F%91&pcodeYear=2016&pcodeNum=&childtype=&subchildtype=&filetype=&timetype=timeqb&mintime=&maxtime=&sort=pubtime&nocorrect=&sortType=1"

#opens the output doc
output_txt = open("output.txt", "w")

#opens the output doc
output_txt = open("output.txt", "w")

def headliner(url):


    #opens the url for read access
    this_url = urllib.urlopen(url).read()
    #creates a new BS holder based on the URL
    soup = BeautifulSoup(this_url, 'lxml')

    #creates the headline section
    headline_text = ''
    #this bundles all of the headlines
    headline = soup.find_all('h3')
    #for each individual headline....
    for element in headline:
            headline_text += ''.join(element.findAll(text = True)).encode('utf-8').strip()
            #this is necessary to turn the findAll output into text
            print element
            text = element.text.encode('utf-8')
            #prints each headline
            print text
            print "*******"
            #creates the dictionary for just that headline
            temp_dict = {}
            #puts the headline in the dictionary
            temp_dict['headline'] = text

            #appends the temp_dict to the main list
            holder.append(temp_dict)

            output_txt.write(str(text))
            #output_txt.write(holder)

headliner(txt_contents)
print holder

output_txt.close()
3
  • I'm guessing they are unicode strings? Commented Apr 6, 2017 at 0:50
  • 1
    It's the difference between the __str__ representations and the __repr__ representation. Commented Apr 6, 2017 at 0:51
  • print(dct) calls __repr__ for each element. If you print them separately you will get the value you expect, juanpa mentioned Commented Apr 6, 2017 at 0:53

1 Answer 1

4

The encoding isn't being messed up. It's just different ways of representing the same thing:

>>> s = '漢字'
>>> s
'\xe6\xbc\xa2\xe5\xad\x97'
>>> print(s)
漢字
>>> s.__repr__()
"'\\xe6\\xbc\\xa2\\xe5\\xad\\x97'"
>>> s.__str__()
'\xe6\xbc\xa2\xe5\xad\x97'
>>> print(s.__repr__())
'\xe6\xbc\xa2\xe5\xad\x97'
>>> print(s.__str__())
漢字

The last piece of the puzzle to know is that when you put an object in a container, it prints the repr to represent those objects inside the container in the container's representations:

>>> ls = [s]
>>> print(ls)
['\xe6\xbc\xa2\xe5\xad\x97']

Perhaps it will become more clear if we define our own custom object:

>>> class A(object):
...     def __str__(self):
...         return "str"
...     def __repr__(self):
...         return "repr"
...
>>> A()
repr
>>> print(A())
str
>>> ayes  = [A() for _ in range(5)]
>>> ayes
[repr, repr, repr, repr, repr]
>>> print(ayes[0])
str
>>>
Sign up to request clarification or add additional context in comments.

12 Comments

If you're using unicode literals (s = u'漢字'), you'll get a UnicodeEncodeError if you do s.__str__(), but __repr__ gives you the encoding and print formats it as you would expect.
Thanks! Does that mean that there isn't a way to make the print(ls) actually print(ls.__str__())?
@mweinberg it is printing ls.__str__(), it's just that ls.__str__() is using the __repr__ of the objects it contains to construct the string!
@TemporalWolf yeah, I think that has something to do with the default encoding in Python 2 being ascii. It's been a while since I've used Python 2, so I don't remember the details exactly.
@mweinberg No, you'll still get the __repr__ of objects in a container in the container's __str__, but, it makes working with unicode and some of the things we were discussing with @TemporalWolf much, much smoother.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.