3

I am reading a txt file into Python, extract parts of it and then output the results as CSV.

Problem is, I get encoding hiccups I don't know how to resolve. Here is what happens:

  1. The input file is an export from Adobe Acrobat, where I output a "plain text file" after changing the setting to "UTF-8".

  2. I then read this into Python like so

    inputfile=codecs.open(inputfile, "r", "utf-8")
    
  3. I run a regex to extract parts of it, make them into a pandas DataFrame (called 'dataframe' here).

  4. Then it write the dataframe out as a csv file, but no matter how I do it I run into problems. I tried

    outputfile=codecs.open(outputfile, "w", "utf-8")
    dataframe.to_csv(outputfile, encoding="utf-8")
    

but that gives me an

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 23:  ordinal not in range(128)

Question:

This is the first thing I don't get: why is there an 'ascii'-codec involved if I am setting the to_csv encoding="utf-8"? Acc to the docs, this is a

A string representing the encoding to use in the output file, defaults to ‘ascii’ on Python 2...

I can avoid this error by not specifying "utf-8" in codecs.open(). But then, once I import the file into Excel (setting import to "Unicode-Utf-8"), all ' characters show up as __. There are no other encoding errors as far as I can see and if I open the csv file in TextWrangler, everything works out just fine.

  • Where is the problem here -- is it the output from pandas, is it Excel, or what else?

  • Is there a better way of tackling this?

I am working in Python 2 on a Mac. I did not use the Python csv module because it doesn't do UTF without a workaround.

Any help is appreciated!

EDIT: Here is what the inputfile looks like in WordWrangler:

23 It’s lying down there on the floor.

Here is Excel:

It__s lying down there on the floor.

After Fawful's helpful comment, I also tried to open the original text file in Excel. Seems like it already encodes the ' as __ in that one.

0

3 Answers 3

1

Not a clean solution, but for a quick fix just use .replace('\0xe2', ' ').

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that would indeed work! I'd rather avoid that though as I have a couple more files to process and I'm afraid the hacks might accumulate...
1

When writing the file from your pandas dataframe, do not use a codecs file object. pandas.to_csv() already encodes your data, and the codecs file object then has to try to decode (as ASCII) in order to be able to re-encode it.

Just use a regular file:

with open(outputfile, "w") as outputfile:
    dataframe.to_csv(outputfile, encoding="utf-8")

You can use the csv module as well, but then you have to encode all your row data to UTF-8 before passing the row to the csv.writer().writerow() function. The csv module examples section includes code that automates this for you.

3 Comments

Thanks! I have a quick question re (outputfile, "w", "utf-8") -- this does not work for me, it asks for an "integer" instead. Can open really take the "utf-8" setting? I couldn't find it in the docs either... If I get rid of the utf-8 part, things work but Excel still messes up in the same way. I'd think it's an issue with Acrobat's export, but then everything works just fine in WordWrangler...
@patrick: sorry, that was my mistake, a copy and paste error.
ah, okay! It's good to know about the pandas-encoding thing though. I'll give the csv module a try but beginning to think it's not really something that I am doing but Excel or Acrobat.
0

So looks like this is an Excel (at least Mac Excel 2011) issue; appears that while it does offer an "Unicode UTF-8" setting, it actually can't deal with unicode characters.

See e.g. this SO post and this post from Microsoft. My files open just fine in Numbers. Didn't even know I had that on this machine...

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.