1

I have built Turbodbc 5.1.2 from source with simdutf 7.3.0, Python 3.11. When trying to insert 150,000 rows of 46 columns to a MySQL 8.0 InnoDB table, Turbodb takes about 190s, compared to 15s with my existing method. I have modeled my attempt after the advice in here the advanced usage section https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#using-numpy-arrays-as-query-parameters:

options = turbodbc.make_options(
            use_async_io=True,
        )
        conn = turbodbc.connect(
            driver="MySQL Driver",
            server = get("host"),
            port=3306,
            uid=get("user"),
            pwd=get("pw"),
            plugin_dir="/usr/local/lib/plugin",
            turbodbc_options = options,
        )
cursor = conn.cursor()
cols = str(insert_df.columns).replace("'", "").replace("[", "(").replace("]", ")")
params = "(" + ", ".join(["?" for _ in insert_df.columns]) + ")"
insert_df = insert_df.with_columns(
    cs.float().cast(pl.Float64),
    cs.integer().cast(pl.Int64)
)
values = [x.to_numpy() for x in insert_df.iter_columns()]
on_duplicate_key_update_stmts = (
    str([i + " = VALUES(" + i + ")" for i in insert_df.columns])
    .replace("[", "")
    .replace("]", "")
    .replace("'", "")
)
cursor.executemanycolumns(f"INSERT INTO {table_name} {cols} VALUES {params} ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), " + on_duplicate_key_update_stmts + ", modified_at=NOW();", values)

I also tried a row-wise insert like:

cursor.executemany(f"INSERT INTO {table_name} {cols} VALUES {params} ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), " + on_duplicate_key_update_stmts + ", modified_at=NOW();", list(insert_df.to_numpy()))

but this also took about 190s.

My attempt using sqlalchemy looks just like the above cursor.executemany(query, data), but with the engine made from sqlalchemy and it uses %s to denote parameter substitutions:

INSERT INTO schema.table(my, forty, six, column, names, here, ...) VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), col1 = VALUES(col1), col2 = VALUES(col2), ...,  modified_at=NOW();

Any help would be much appreciated.

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.