4

I scrapped a webpage with BeautifulSoup. I got great output except parts of the list look like this after getting the text:

list = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

My question now is how to get rid or replace these double backslashes with the special characters they are.

If i print the first the first element of the example list the output looks like

print list[0]
that\u2019s

I already read a lot of other questions / threads about this topic but I ended up being even more confused, as I am a beginner considering unicode / encoding / decoding.

I hope that someone could help me with this issue.

Thanks! MG

5
  • 1
    @mgruber remember to accept an answer if it helped you Commented Jan 4, 2017 at 17:17
  • Unless the web page literally contains unicode escape sequences like that (that\u2019s instead of that’s), beautifulsoup will not return strings in that form. It will return the text without escaping anything. How are you getting those strings? Commented Jan 4, 2017 at 20:04
  • I performed a regex in the same time and it seems like that this was the problem. Do you have any ad hoc explanations for that? Commented Jan 5, 2017 at 8:35
  • Have you scraped sub-parts of a JSON structure? If so you should instead try to read the whole JSON value, parse it using json.loads and access the pieces of it you want from there. Commented Jan 5, 2017 at 11:02
  • I did access them by first loading it with data = json.load(name_of_file) and then I only got the stuff I want with raw = data['html'].I assume that the next step where I tried to get rid of comments (still got some left after using BeautifulSoup in some cases) with raw = re-sub('(?s)<!--.*?-->', '',str(raw)) got my output messy. Commented Jan 5, 2017 at 13:17

2 Answers 2

13

Since you are using Python 2 there, it is simply a matter of re-applying the "decode" method - using the special codec "unicode_escape". It "sees" the "physical" backlashes and decodes those sequences proper unicode characters:

data =  [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

result = [part.decode('unicode_escape') for part in data]

To aAnyone getting here using Python3: in that version can not apply the "decode" method to the str objects delivered by beautifulsoup - one has to first re-encode those to byte-string objects, and then decode with the uncode_escape codec. For these purposes it is usefull to make use of the latin1 codec as the transparent encoding: all bytes in the str object are preserved in the new bytes object:

result = [part.encode('latin1').decode('unicode_escape') for part in data]
Sign up to request clarification or add additional context in comments.

4 Comments

AttributeError: 'str' object has no attribute 'decode'
You are using Python 3, and the OP and this example are both in Python2. (In python 2, to start with, a u" " prefixed string is an unicode object, not an str). Please, the voting system is not meant for personal vendetas - it is meant for marking incorrect answers.
I don't see in the question any reference for Python version
That is just because you are 't used to Python's different versions. There is the "print" statement instead of a function, among other clues.
4

the problem here is that the site ended up double encoding those unicode arguments, just do the following:

ls = [u'that\\u2019s', u'it\\u2019ll', u'It\\u2019s', u'don\\u2019t', u'That\\u2019s', u'we\\u2019re', u'\\u2013']

ls = map(lambda x: x.decode('unicode-escape'), ls)

now you have a list with properly unicode encoded strings:

for a in ls:
   print a

10 Comments

I first tried your solution on my whole list and it didn´t work. Then I copied your 4 code lines into a script and tried to run it and it threw the following error: UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 4: character maps to <undefined>
you should include your full example in order to understand your question better. That new error is happening because you have strings inside your list that don't have double backslashes, so they are already decoded. You'll have to remove the good ones before, or use a try:except function
This is more likely a problem when you try to print the decoded string in a terminal which can't map properly this char. Check your error message for the line where the error occurs. This answer is correct.
If you are on windows you simply won't be able to see the correct output for this on the CMD terminal - beacuase it uses an encoding with only 256 characters that does not include the "\u2019" char. Try saving your results to an utf-8 encoded file and opening that in an editor instead.
@mgruber you just need to encode it to utf-8. Check this answer
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.