3

Does anyone know how to setup a MySQL table column to hold the utf8mb4 charset with sqlalchemy or specifically with flask_sqlalchemy? The connection is described here https://docs.sqlalchemy.org/en/14/dialects/mysql.html but I cant find out how to change the table properties to make the flask migrations create the table properly.

The problem I'm seeing is that some email content (which I'm trying to save) can have 4 byte Unicode and will fail to insert to the table with this error

 (pymysql.err.DataError) (1366, "Incorrect string value: '\\xEF\\xBB\\xBF\\x0D\\x0A>...' for column `test`.`email`.`content_text` at row 1")

which corresponds to this string snippet in email

>=20
> =EF=BB=BF
>=20

and can be replicated as a test like this which if sent to db will recreate the error:

print(f'bad chr >> {chr(65279)} <<')

to enable the table to store this char you can use alter which is not great because its post declaration.

ALTER TABLE email
  DEFAULT CHARACTER SET utf8mb4,
  MODIFY content_text MEDIUMTEXT
    CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL

So I would like that in the declaration of the table but cannot figure out how this is done with the flask orm interface.

Here is a test script to recreate the problem (after a test db exists and test user has grant all on it to create the table)

from flask import Flask
from flask_sqlalchemy import SQLAlchemy
from flask_migrate import Migrate
from sqlalchemy.dialects.mysql import MEDIUMTEXT

app = Flask(__name__)

app.config['SQLALCHEMY_DATABASE_URI'] = 'mysql+pymysql://test:test@localhost/test?charset=utf8mb4' 



db = SQLAlchemy(app)
Migrate(app,db)

class Email(db.Model):
    __tablename__ = 'email'
    id = db.Column(db.Integer, primary_key=True)
    content_text = db.Column(MEDIUMTEXT)


@app.cli.command("test_schema")
def test_schema():

    for i, row in  enumerate(db.session.execute('desc email'), 1):
        [print(f'row {i} {row._fields[y]:<10s} : {str(col):<20s}') for y,col in enumerate(row)]
        print()
        
        
@app.cli.command("test_problem")
def test_problem():
 
    new_mail = Email( content_text = f'bad chr >> {chr(65279)} <<' )
    db.session.add( new_mail )
    db.session.commit()
        

@app.cli.command("drop")
def drop():
    db.session.execute('drop table email')

calling it several times to create the schema then test the problem with the command line functions I added.

# creates the table provided test db is created and user has grant all set to that test db
FLASK_APP=test_schema.py flask db init
FLASK_APP=test_schema.py flask db migrate
FLASK_APP=test_schema.py flask db upgrade

# test the problem
FLASK_APP=test_schema.py flask test_problem

so I would like to be able to define content_text = db.Column(MEDIUMTEXT) without doing an alter db.

These are the database objects prior to any ALTER TABLE:

> SHOW CREATE DATABASE test;
CREATE DATABASE 'test' /*!40100 DEFAULT CHARACTER SET latin1 */  

> SHOW CREATE TABLE email;

CREATE TABLE 'email' (
   'id' int(11) NOT NULL AUTO_INCREMENT,
   'content_text' mediumtext DEFAULT NULL,
   PRIMARY KEY ('id') 
) ENGINE=InnoDB DEFAULT CHARSET=latin1
3
  • So if I understand correctly, you want to know how you can have the column created with the correct charset automatically, without having to do an ALTER TABLE after the table is created? Would creating the database with charset utf8mb4 be an acceptable solution? Commented Sep 16, 2021 at 15:10
  • You mean so all tables default to utf8mb4 ? To alter the database would push app specific responsibility to the DB admin and you may not want this on all tables for other reasons. So personally I think the solution should be in the declaration of the table(s). Commented Sep 16, 2021 at 16:02
  • I searched though the sqlalchemy handling of dialects and I found this works content_text = db.Column(MEDIUMTEXT(unicode=True)) not sure if it has any side effects. Commented Sep 16, 2021 at 16:09

2 Answers 2

2

Ah I was confused by the doc https://docs.sqlalchemy.org/en/14/dialects/mysql.html#mysql-data-types

These are classes to implement types. The implementation of MEDIUMTEXT's docstring shows that it accepts args that could be passed. Here is the rendered docstring https://docs.sqlalchemy.org/en/14/dialects/mysql.html#sqlalchemy.dialects.mysql.MEDIUMTEXT

So I tried content_text = db.Column(MEDIUMTEXT(unicode=True)) will set to 16 bit but with help from the comments from we got this right content_text =db.Column(MEDIUMTEXT(charset='utf8mb4')) will set this properly to 32 bit on the table.

Just to be sure I verified with md5 checksum that the 16 bit char content what went into the table, came out the same.

Sign up to request clarification or add additional context in comments.

7 Comments

You could define the column as MEDIUMTEXT(charset='utf8mb4') to cover the entire UTF-8 range.
setting charset='utf8mb4' actually creates a table with ucs2 I briefly looked though the mysql dialect to see why but its not clear how that would persist to be 16 bit.
@PeterMoore - charset="utf8mb4" works for me with SQLAlchemy 1.4.23. Are you using an older version?
Ah I updated from SQLAlchemy==1.4.20 and Flask-Migrate==3.0.1 to SQLAlchemy==1.4.23 and Flask-Migrate==3.1.0 and now it does create a utf8mb4. ok ill update the answer. Thanks for the feedback guys.
@PeterMoore - utf8mb4 characters are 1,2,3, or 4 bytes. ucs2 characters are 2 or 4 bytes; they are not the same.
|
-1

That's a BOM: http://mysql.rjweb.org/doc.php/charcoll#bom_byte_order_mark

Either remove the first 3 bytes of the file, or have the program building that file not include the BOM.

What encoding did you specify in the connection string?

What encoding is the column in question (in MySQL)? It looks like it is latin1, which will not know what to do the the UTF-8 BOM. Do you have any accented letters? If not, switch to utf8mb4.

Search forward in the about link to find my notes on "SQLAlchemy".

1 Comment

To modify bytes in the original data buffer seem like a bad idea. This is email data which you have no control over the content. The question was how to make SQL Alchemy store utf8mb4 format in MySql. Specifically char code 65279 to persist and re-emerge.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.