5

I need to parse various text sources and then print / store it somewhere.

Every time a non ASCII character is encountered, I can't correctly print it as it gets converted to bytes, and I have no idea how to view the correct characters.

(I'm quite new to Python, I come from PHP where I never had any utf-8 issues)

The following is a code example:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title').encode('utf-8')

print(title)

file = codecs.open("test.txt", "w", "utf-8")
file.write(str(title))
file.close()

I'd like to print and write in a file the RSS title (BBC Japanese - ホーム) but instead the result is this:

b'BBC Japanese - \xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xa0'

Both on screen and file. Is there a proper way to do this ?

3 Answers 3

10

In python3 bytes and str are two different types - and str is used to represent any type of string (also unicode), when you encode() something, you convert it from it's str representation to it's bytes representation for a specific encoding.

In your case in order to the decoded strings, you just need to remove the encode('utf-8') part:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import feedparser

url = "http://feeds.bbci.co.uk/japanese/rss.xml"
feeds = feedparser.parse(url)
title = feeds['feed'].get('title')

print(title)

file = codecs.open("test.txt", "w", encoding="utf-8")
file.write(title)
file.close()
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks Dean, but in this case the print has this exception, that I encountered before, but could not fix: UnicodeEncodeError: 'charmap' codec can't encode characters in position 15-17: character maps to <undefined>
Which version of python are you using exactly? This works for me on the latest version of python...
Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) on Windows
The default consoles on windows aren't very friendly to the printing of unicode chars. Is the text written to the file properly?
You are right, the file is correctly written. Seem easier than expected! Now I have to fix the Windows console issue.
2

JSON data to Unicode support for Japanese characters

def jsonFileCreation (messageData, fileName): 
   with open(fileName, "w", encoding="utf-8") as outfile:
         json.dump(messageData, outfile, indent=8, sort_keys=False,ensure_ascii=False)

1 Comment

Please paste code snippet inside ``` and close it.
1

The function print(A) in python3 will first convert the string A to bytes with its original encoding, and then print it through 'gbk' encoding. So if you want to print A in utf-8, you first need to convert A with gbk as follow:

print(A.encode('gbk','ignore').decode('gbk'))

2 Comments

Isn't UTF-8 the default in Python3?
“utf-8 default in python3” means what you are editing is in utf-8,but in windows, the "print" method can only print string which can be encoded in gbk. The reason is what I have answered.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.