3

I am trying to write data in a StringIO object using Python and then ultimately load this data into a postgres database using psycopg2's copy_from() function.

First when I did this, the copy_from() was throwing an error: ERROR: invalid byte sequence for encoding "UTF8": 0xc92 So I followed this question.

I figured out that my Postgres database has UTF8 encoding.

The file/StringIO object I am writing my data into shows its encoding as the following: setgid Non-ISO extended-ASCII English text, with very long lines, with CRLF line terminators

I tried to encode every string that I am writing to the intermediate file/StringIO object into UTF8 format. To do this used .encode(encoding='UTF-8',errors='strict')) for every string.

This is the error I got now: UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 47: ordinal not in range(128)

What does it mean? How do I fix it?

EDIT: I am using Python 2.7 Some pieces of my code:

I read from a MySQL database that has data encoded in UTF-8 as per MySQL Workbench. This is a few lines code for writing my data (that's obtained from MySQL db) to StringIO object:

# Populate the table_data variable with rows delimited by \n and columns delimited by \t
row_num=0
for row in cursor.fetchall() :

    # Separate rows in a table by new line delimiter
    if(row_num!=0):
        table_data.write("\n")

    col_num=0
    for cell in row:    
        # Separate cells in a row by tab delimiter
        if(col_num!=0):
            table_data.write("\t") 

        table_data.write(cell.encode(encoding='UTF-8',errors='strict'))
        col_num = col_num+1

    row_num = row_num+1   

This is the code that writes to Postgres database from my StringIO object table_data:

cursor = db_connection.cursor()
cursor.copy_from(table_data, <postgres_table_name>)
7
  • Show your code please Commented Oct 28, 2014 at 23:17
  • Which MySQL wrapper are you using? Commented Oct 28, 2014 at 23:59
  • Also, instead of showing "some pieces of my code", create a self-contained minimal, complete, verifiable example and post that here. Commented Oct 29, 2014 at 0:02
  • I am using MySQLdb as the python package to retrieve data from MySQL Commented Oct 29, 2014 at 0:28
  • Meanwhile, after your updates and responses, there's still no MCVE, and other critical information is only visible in comments, which means anyone searching because they want to help people like you or because they have a similar problem won't see it. Commented Oct 29, 2014 at 22:34

1 Answer 1

7

The problem is that you're calling encode on a str object.

A str is a byte string, usually representing text encoded in some way like UTF-8. When you call encode on that, it first has to be decoded back to text, so the text can be re-encoded. By default, Python does that by calling s.decode(sys.getgetdefaultencoding()), and getdefaultencoding() usually returns 'ascii'.

So, you're talking UTF-8 encoded text, decoding it as if it were ASCII, then re-encoding it in UTF-8.

The general solution is to explicitly call decode with the right encoding, instead of letting Python use the default, and then encode the result.

But when the right encoding is already the one you want, the easier solution is to just skip the .decode('utf-8').encode('utf-8') and just use the UTF-8 str as the UTF-8 str that it already is.

Or, alternatively, if your MySQL wrapper has a feature to let you specify an encoding and get back unicode values for CHAR/VARCHAR/TEXT columns instead of str values (e.g., in MySQLdb, you pass use_unicode=True to the connect call, or charset='UTF-8' if your database is too old to auto-detect it), just do that. Then you'll have unicode objects, and you can call .encode('utf-8') on them.

In general, the best way to deal with Unicode problems is the last one—decode everything as early as possible, do all the processing in Unicode, and then encode as late as possible. But either way, you have to be consistent. Don't call str on something that might be a unicode; don't concatenate a str literal to a unicode or pass one to its replace method; etc. Any time you mix and match, Python is going to implicitly convert for you, using your default encoding, which is almost never what you want.

As a side note, this is one of the many things that Python 3.x's Unicode changes help with. First, str is now Unicode text, not encoded bytes. More importantly, if you have encoded bytes, e.g., in a bytes object, calling encode will give you an AttributeError instead of trying to silently decode so it can re-encode. And, similarly, trying to mix and match Unicode and bytes will give you an obvious TypeError, instead of an implicit conversion that succeeds in some cases and gives a cryptic message about an encode or decode you didn't ask for in others.

Sign up to request clarification or add additional context in comments.

7 Comments

I tried passing in charset='UTF-8' in my MYSQL DB connection with use_unicode=True. Also realized that some of the tables in source MySQL database are latin1_swedish_ci while some are utf_8. I get an error like this now: db_connection = MySQLdb.connect(host=host,user=user,passwd=passwd,db=db, charset="utf8", use_unicode=True) cell = str(cell).replace("\r", " ").replace("\n", " ").replace("\t", '').replace("\"", "") UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 47: ordinal not in range(128)
@user3422637: OK, if you want to use unicode objects instead of str—which is a good idea—then you can't go calling str on them, because that will immediately try to encode them to your default charset. And you shouldn't mix and match unicode and str objects like you're doing, because that also has to implicitly encode or decode one or the other; use replace(u"\r", u" ") and the like. (Actually, "\r" is bad in another way—if you have backslashes, either escape them, or use a raw string literal.)
@user3422637: More generally, if you're trying to bash on things until they work without actually understanding it, you're just going to get more and more lost. If you haven't read the Unicode HOWTO, read that.
@user3422637: Finally, depending on your version of MySQL and MySQLdb, you may still have problems with those latin1_swedish_ci tables (with newer versions, the problems may go away if you leave off the charset= and let it auto-detect, but they may not). But don't worry about that unless/until you get there; first get your code working with the UTF-8 tables, then test on the Swedish ones.
Thanks a lot for your continuous help. So, I figured out that ultimately, I need to write the data in str format because I am writing it to StringIO object (which then copies data to Postgres using copy_from). However, I had to deal with all those unicode characters like u'u\2018', u'\xc9' etc that couldn't be handled by the str() function. Hence I imported the unicode python package and used unicode() function to convert all these unicode characters to the closes ASCII characters. Then I did an str() on that. This solved my problems. Thanks.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.