4

I am trying to extract some data from a JSON file which contains tweets and write it to a csv. The file contains all kinds of characters, I'm guessing this is why i get this error message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'

I guess I have to convert the output to utf-8 before writing the csv file, but I have not been able to do that. I have found similar questions here on stackoverflow, but not I've not been able to adapt the solutions to my problem (I should add that I am not really familiar with python. I'm a social scientist, not a programmer)

import csv
import json

fieldnames = ['id', 'text']

with open('MY_SOURCE_FILE', 'r') as f, open('MY_OUTPUT', 'a') as out:

    writer = csv.DictWriter(
                    out, fieldnames=fieldnames, delimiter=',', quoting=csv.QUOTE_ALL)

    for line in f:
        tweet = json.loads(line)
        user = tweet['user']
        output = {
            'text': tweet['text'],
            'id': tweet['id'],
        }
        writer.writerow(output)
10
  • could you try import codes with codecs.open('MY_SOURCE_FILE', 'r', encoding='utf-8') as f, codecs.open('MY_OUTPUT', 'a', encoding='utf-8') as out: the codecs module will handle the decoding and encoding for you Commented Apr 26, 2015 at 10:12
  • Actually can you show what your file looks like? Commented Apr 26, 2015 at 10:17
  • 1
    The examples section in the csv documentation says "The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended." Note this is only part of your problem. The other is trying to read the JSON file line-by-line. Commented Apr 26, 2015 at 10:44
  • 1
    @martineau, the file has lines of different json objects. All that is needed is an encode Commented Apr 26, 2015 at 10:46
  • 1
    @Padraic: How do you know the each is is a different JSON object? Commented Apr 26, 2015 at 10:49

1 Answer 1

6

You just need to encode the text to utf-8:

for line in f:
    tweet = json.loads(line)
    user = tweet['user']
    output = {
        'text': tweet['text'].encode("utf-8"),
        'id': tweet['id'],
    }
    writer.writerow(output)

The csv module does not support writing unicode in python2:

Note This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

Sign up to request clarification or add additional context in comments.

1 Comment

this finally worked for me! Thanks for everyone who contributed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.