I have built Turbodbc 5.1.2 from source with simdutf 7.3.0, Python 3.11. When trying to insert 150,000 rows of 46 columns to a MySQL 8.0 InnoDB table, Turbodb takes about 190s, compared to 15s with my existing method. I have modeled my attempt after the advice in here the advanced usage section https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#using-numpy-arrays-as-query-parameters:
options = turbodbc.make_options(
use_async_io=True,
)
conn = turbodbc.connect(
driver="MySQL Driver",
server = get("host"),
port=3306,
uid=get("user"),
pwd=get("pw"),
plugin_dir="/usr/local/lib/plugin",
turbodbc_options = options,
)
cursor = conn.cursor()
cols = str(insert_df.columns).replace("'", "").replace("[", "(").replace("]", ")")
params = "(" + ", ".join(["?" for _ in insert_df.columns]) + ")"
insert_df = insert_df.with_columns(
cs.float().cast(pl.Float64),
cs.integer().cast(pl.Int64)
)
values = [x.to_numpy() for x in insert_df.iter_columns()]
on_duplicate_key_update_stmts = (
str([i + " = VALUES(" + i + ")" for i in insert_df.columns])
.replace("[", "")
.replace("]", "")
.replace("'", "")
)
cursor.executemanycolumns(f"INSERT INTO {table_name} {cols} VALUES {params} ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), " + on_duplicate_key_update_stmts + ", modified_at=NOW();", values)
I also tried a row-wise insert like:
cursor.executemany(f"INSERT INTO {table_name} {cols} VALUES {params} ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), " + on_duplicate_key_update_stmts + ", modified_at=NOW();", list(insert_df.to_numpy()))
but this also took about 190s.
My attempt using sqlalchemy looks just like the above cursor.executemany(query, data), but with the engine made from sqlalchemy and it uses %s to denote parameter substitutions:
INSERT INTO schema.table(my, forty, six, column, names, here, ...) VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), col1 = VALUES(col1), col2 = VALUES(col2), ..., modified_at=NOW();
Any help would be much appreciated.