Python Error; UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

Question

I am trying to extract some data from a JSON file which contains tweets and write it to a csv. The file contains all kinds of characters, I'm guessing this is why i get this error message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

I guess I have to convert the output to utf-8 before writing the csv file, but I have not been able to do that. I have found similar questions here on stackoverflow, but not I've not been able to adapt the solutions to my problem (I should add that I am not really familiar with python. I'm a social scientist, not a programmer)

import csv
import json

fieldnames = ['id', 'text']

with open('MY_SOURCE_FILE', 'r') as f, open('MY_OUTPUT', 'a') as out:

    writer = csv.DictWriter(
                    out, fieldnames=fieldnames, delimiter=',', quoting=csv.QUOTE_ALL)

    for line in f:
        tweet = json.loads(line)
        user = tweet['user']
        output = {
            'text': tweet['text'],
            'id': tweet['id'],
        }
        writer.writerow(output)

could you try import codes with codecs.open('MY_SOURCE_FILE', 'r', encoding='utf-8') as f, codecs.open('MY_OUTPUT', 'a', encoding='utf-8') as out: the codecs module will handle the decoding and encoding for you — EdChum
– EdChum, Commented Apr 26, 2015 at 10:12
The examples section in the csv documentation says "The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended." Note this is only part of your problem. The other is trying to read the JSON file line-by-line. — martineau
– martineau, Commented Apr 26, 2015 at 10:44
@martineau, the file has lines of different json objects. All that is needed is an encode — Padraic Cunningham
– Padraic Cunningham, Commented Apr 26, 2015 at 10:46
@Padraic: How do you know the each is is a different JSON object? — martineau
– martineau, Commented Apr 26, 2015 at 10:49

Padraic Cunningham · Accepted Answer · 2015-04-26 10:54:26Z

6

You just need to encode the text to utf-8:

for line in f:
    tweet = json.loads(line)
    user = tweet['user']
    output = {
        'text': tweet['text'].encode("utf-8"),
        'id': tweet['id'],
    }
    writer.writerow(output)

The csv module does not support writing unicode in python2:

Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

edited Apr 26, 2015 at 10:54

answered Apr 26, 2015 at 10:42

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

5mark Over a year ago

this finally worked for me! Thanks for everyone who contributed.

Collectives™ on Stack Overflow

Python Error; UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related