How to handle utf-8 text with Python 3?

Question

I need to parse various text sources and then print / store it somewhere.

Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.

(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)

The following is a code example:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')

print(title)

file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()

I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:

b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'

Both on screen and file. Is there a proper way to do this ?

Chris_Rands · Accepted Answer · 2018-03-21 21:10:42Z

10

In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.

In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')

print(title)

file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()

edited Mar 21, 2018 at 21:10

Chris_Rands

41.7k15 gold badges92 silver badges126 bronze badges

answered Jul 13, 2016 at 9:31

Dean Fenster

2,3951 gold badge19 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Omiod Over a year ago

Thanks Dean, but in this case the print has this exception, that I encountered before, but could not fix: UnicodeEncodeError: 'charmap' codec can't encode characters in position 15-17: character maps to <undefined>

Dean Fenster Over a year ago

Which version of python are you using exactly? This works for me on the latest version of python...

Omiod Over a year ago

Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) on Windows

Dean Fenster Over a year ago

The default consoles on windows aren't very friendly to the printing of unicode chars. Is the text written to the file properly?

Omiod Over a year ago

You are right, the file is correctly written. Seem easier than expected! Now I have to fix the Windows console issue.

rizerphe · Accepted Answer · 2021-07-05 03:01:40Z

2

JSON data to Unicode support for Japanese characters

def jsonFileCreation (messageData, fileName): 
   with open(fileName, "w", encoding="utf-8") as outfile:
         json.dump(messageData, outfile, indent=8, sort_keys=False,ensure_ascii=False)

edited Jul 5, 2021 at 3:01

rizerphe

1,3961 gold badge17 silver badges26 bronze badges

answered Jul 1, 2021 at 10:07

Tarkeshwar Prasad

211 bronze badge

1 Comment

arun n a Over a year ago

Please paste code snippet inside ``` and close it.

qjx · Accepted Answer · 2020-11-20 08:48:08Z

1

The function print(A) in python3 will first convert the string A to bytes with its original encoding, and then print it through 'gbk' encoding. So if you want to print A in utf-8, you first need to convert A with gbk as follow:

print(A.encode('gbk','ignore').decode('gbk'))

edited Nov 20, 2020 at 8:48

answered Nov 20, 2020 at 7:30

qjx

113 bronze badges

2 Comments

liakoyras Over a year ago

Isn't UTF-8 the default in Python3?

qjx Over a year ago

“utf-8 default in python3” means what you are editing is in utf-8，but in windows, the "print" method can only print string which can be encoded in gbk. The reason is what I have answered.

Collectives™ on Stack Overflow

How to handle utf-8 text with Python 3?

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related