0

Error:

pymysql.err.InternalError: (1366, "Incorrect string value: '\\xEF\\xBF\\xBD 20...' for column 'history' at row 1")

I've received a few variations of this as I've tried to tweak my dictionary, always in the history column, the only variations is the characters it tells me are issues.

I can't post the dictionary because it's got sensitive information, but here is the jist:

  • I started with 200 addresses (including state, zip, etc) that needed to be validated, normalized and standardized for DB insertion.
  • I spent a lot of time on google maps validating and standardizing.
  • I decided to get fancy, and put all the crazy accented letters in the addresses of these world addresses (often copies from google because I don't know how to type and A with an o over it, lol), Singapore to Brazil, everywhere.
  • I ended up with 120 unique addresses in my dictionary after processing.
  • Everything works 100% perfectly when INSERTING the data in SQLite and OUTPUTING to a CSV. The issue is exclusively with MySQL and some sneaky un-viewable characters.

Note: I used this to remove the accents after 7 hours of copy/pasting to notepad, encoding it with notepad++ and just trying to processes the data in a way that made it all the correct encoding. I think I did lose the version with the accents and only have this tools output now.

I do not see "\xEF\xBF\xBD 20..." in my dictionary I only see text. Currently I don't even see "20"... those two chars helped me find the previous issues.

Code I can show:

def insert_tables(cursor, assets_final, ips_final):
    #Insert Asset data into asset table
    field_names_dict = get_asset_field_names(assets_final)
    sql_field_names = ",".join(field_names_dict.keys())
    for key, row in assets_final.items():
        insert_sql = 'INSERT INTO asset(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(row.values())) + '")'
        print(insert_sql)
        cursor.execute(insert_sql)

    #Insert IP data into IP table
    field_names_dict = get_ip_field_names(ips_final)
    sql_field_names = ",".join(field_names_dict.keys())
    for hostname_key, ip_dict in ips_final.items():
        for ip_key, ip_row in ip_dict.items():
            insert_sql = 'INSERT INTO ip(' + sql_field_names + ') VALUES ("' + '","'.join(field_value.replace('"', "'") for field_value in list(ip_row.values())) + '")'
            print(insert_sql)
            cursor.execute(insert_sql)

def output_sqlite_db(sqlite_file, assets_final, ips_final):
    conn = sqlite3.connect(sqlite_file)
    cursor = conn.cursor()
    insert_tables(cursor, assets_final, ips_final)
    conn.commit()
    conn.close()

def output_mysql_db(assets_final, ips_final):
    conn = mysql.connect(host=config.mysql_ip, port=config.mysql_port, user=config.mysql_user, password=config.mysql_password, charset="utf8mb4", use_unicode=True)
    cursor = conn.cursor()
    cursor.execute('USE ' + config.mysql_DB)
    insert_tables(cursor, assets_final, ips_final)
    conn.commit()
    conn.close()

EDIT: Could this have something to do with the fact I'm using Cygwin as my terminal? HA! I added this line and got a different message (now using the accented version again):

cursor.execute('SET NAMES utf8')

Error:

pymysql.err.InternalError: (1366, "Incorrect string value: '\\xC5\\x81A II...' for column 'history' at row 1")
2
  • Check out my scribblings about Python Commented Feb 16, 2017 at 22:03
  • dictionary tag removed; encoding probjem in a varchar column of a db table is being conflated by SO with the Python dict (dictionary) type. Commented Feb 18, 2017 at 6:41

1 Answer 1

1

I can shine a bit of light on the messages that you have supplied:

Case 1:

>>> import unicodedata as ucd
>>> s1 = b"\xEF\xBF\xBD"
>>> s1
b'\xef\xbf\xbd'
>>> u1 = s1.decode('utf8')
>>> u1
'\ufffd'
>>> ucd.name(u1)
'REPLACEMENT CHARACTER'
>>>

Looks like you have obtained some bytes encoded in an encoding other than utf8 (e.g. cp1252) then tried bytes.decode(encoding='utf8', errors='strict'). This detected some errors. You then decoded again with errors="replace". This raised no exceptions. However your data has had the error bytes replaced by the replacement character (U+FFFD). Then you encoded your data using str.encodeso that you could write to a file or database. Each replacement characters turns up as 3 hex bytes EF BF BD.

... more to come

Case 2:

>>> s2 = b"\xC5\x81A II"
>>> s2
b'\xc5\x81A II'
>>> u2 = s2.decode('utf8')
>>> u2
'\u0141A II'
>>> ucd.name(u2[0])
'LATIN CAPITAL LETTER L WITH STROKE'
>>>
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.