1

I'm using Requests and BeautifulSoup with Python 3.4 to scrape information off a website that may or may not contain Japanese or other special characters.

def startThisPage(url):
    r = requests.get(str(url))
    r.encoding="utf8"
    print(r.content.decode('utf8'))
    soup = BeautifulSoup(r.content,'html.parser')
    print(soup.h2.string)

The h2 contains this: "Fate/kaleid liner Prisma ☆ Ilya Zwei!" and I'm pretty sure the star is what is giving me troubles right now.

The error code that is being thrown at me:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2606' in position 25: character maps to <undefined>

The page is encoded with utf8 and hence I tried to encode and decode with utf8 the byte string I'm receiving with r.content. I've also tried to decode first with unicode_escape thinking it was because of double \ but that wasn't the case. Any ideas?

5
  • Are you on Windows? Printing UTF-8 to windows consoles is notoriously not going to work. Commented Aug 24, 2015 at 22:28
  • I am running window 7 64bit. And how would I get around it since I don't have Ubuntu installed. @OdraEncoded Commented Aug 24, 2015 at 23:22
  • You could write it to a file instead of printing or remove non-ASCII characters. You could also make a GUI for showing the output if you need it real time. Honestly I wouldn't bother trying to get the windows console to display characters right, maybe PowerShell (the new C#-based command prompt) can print them. Commented Aug 24, 2015 at 23:43
  • I still get the same error trying to write to a file... Commented Aug 25, 2015 at 0:07
  • unrelated: you could use BeautifulSoup(requests.get(url).text) pass Unicode or even BeautifulSoup(urllib.request.urlopen(url)) to pass bytes as is (assuming urlopen() works for the url). Commented Aug 25, 2015 at 7:42

1 Answer 1

2

soup.h2.string is a Unicode string. The console character encoding such as cp437 can't represent some of the Unicode characters (☆ -- U+2606 WHITE STAR) that leads to the error. To workaround it, see my answer to "Python, Unicode, and the Windows console" question.

I still get the same error trying to write to a file..

Files (created using open()) use locale.getpreferredencoding(False) such as cp1252 by default. Use the explicit character encoding that supports the full Unicode range instead:

import io

with io.open('title.txt', 'w', encoding='utf-8') as file:
    file.write(soup.h2.string)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.