Python - Unicode & double backslashes [duplicate]

Question

I scrapped a webpage with BeautifulSoup. I got great output except parts of the list look like this after getting the text:

list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

My question now is how to get rid or replace these double backslashes with the special characters they are.

If i print the first the first element of the example list the output looks like

print list[0]
that\u2019s

I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.

I hope that someone could help me with this issue.

Thanks! MG

Unless the web page literally contains unicode escape sequences like that (that\u2019s instead of that’s), beautifulsoup will not return strings in that form. It will return the text without escaping anything. How are you getting those strings? — roeland
– roeland, Commented Jan 4, 2017 at 20:04
I performed a regex in the same time and it seems like that this was the problem. Do you have any ad hoc explanations for that? — bootica
– bootica, Commented Jan 5, 2017 at 8:35
Have you scraped sub-parts of a JSON structure? If so you should instead try to read the whole JSON value, parse it using json.loads and access the pieces of it you want from there. — bobince
– bobince, Commented Jan 5, 2017 at 11:02
I did access them by first loading it with data = json.load(name_of_file) and then I only got the stuff I want with raw = data['html'].I assume that the next step where I tried to get rid of comments (still got some left after using BeautifulSoup in some cases) with raw = re-sub('(?s)', '',str(raw)) got my output messy. — bootica
– bootica, Commented Jan 5, 2017 at 13:17

jsbueno · Accepted Answer · 2017-01-04 15:39:04Z

13

Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:

data =  [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

result = [part.decode('unicode_escape') for part in data]

To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:

result = [part.encode('latin1').decode('unicode_escape') for part in data]

edited Jan 4, 2017 at 15:39

answered Jan 4, 2017 at 15:21

jsbueno

113k11 gold badges159 silver badges239 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Jeanderson Candido Over a year ago

AttributeError: 'str' object has no attribute 'decode'

jsbueno Over a year ago

You are using Python 3, and the OP and this example are both in Python2. (In python 2, to start with, a u" " prefixed string is an unicode object, not an str). Please, the voting system is not meant for personal vendetas - it is meant for marking incorrect answers.

Jeanderson Candido Over a year ago

I don't see in the question any reference for Python version

jsbueno Over a year ago

That is just because you are 't used to Python's different versions. There is the "print" statement instead of a function, among other clues.

eLRuLL · Accepted Answer · 2017-01-04 15:15:04Z

4

the problem here is that the site ended up double encoding those unicode arguments, just do the following:

ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

ls = map(lambda x: x.decode('unicode-escape'), ls)

now you have a list with properly unicode encoded strings:

for a in ls:
   print a

answered Jan 4, 2017 at 15:15

eLRuLL

18.8k9 gold badges79 silver badges106 bronze badges

10 Comments

bootica Over a year ago

I first tried your solution on my whole list and it didn´t work. Then I copied your 4 code lines into a script and tried to run it and it threw the following error: UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 4: character maps to <undefined>

eLRuLL Over a year ago

you should include your full example in order to understand your question better. That new error is happening because you have strings inside your list that don't have double backslashes, so they are already decoded. You'll have to remove the good ones before, or use a try:except function

jsbueno Over a year ago

This is more likely a problem when you try to print the decoded string in a terminal which can't map properly this char. Check your error message for the line where the error occurs. This answer is correct.

jsbueno Over a year ago

If you are on windows you simply won't be able to see the correct output for this on the CMD terminal - beacuase it uses an encoding with only 256 characters that does not include the "\u2019" char. Try saving your results to an utf-8 encoded file and opening that in an editor instead.

eLRuLL Over a year ago

@mgruber you just need to encode it to utf-8. Check this answer

|

Collectives™ on Stack Overflow

Python - Unicode & double backslashes [duplicate]

2 Answers 2

4 Comments

10 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

10 Comments

Linked

Related