Extremely slow DB insert using Turbodbc

Ask Question

Asked 5 months ago

Modified 5 months ago

Viewed 64 times

I have built Turbodbc 5.1.2 from source with simdutf 7.3.0, Python 3.11. When trying to insert 150,000 rows of 46 columns to a MySQL 8.0 InnoDB table, Turbodb takes about 190s, compared to 15s with my existing method. I have modeled my attempt after the advice in here the advanced usage section https://turbodbc.readthedocs.io/en/latest/pages/advanced_usage.html#using-numpy-arrays-as-query-parameters:

options = turbodbc.make_options(
            use_async_io=True,
        )
        conn = turbodbc.connect(
            driver="MySQL Driver",
            server = get("host"),
            port=3306,
            uid=get("user"),
            pwd=get("pw"),
            plugin_dir="/usr/local/lib/plugin",
            turbodbc_options = options,
        )
cursor = conn.cursor()
cols = str(insert_df.columns).replace("'", "").replace("[", "(").replace("]", ")")
params = "(" + ", ".join(["?" for _ in insert_df.columns]) + ")"
insert_df = insert_df.with_columns(
    cs.float().cast(pl.Float64),
    cs.integer().cast(pl.Int64)
)
values = [x.to_numpy() for x in insert_df.iter_columns()]
on_duplicate_key_update_stmts = (
    str([i + " = VALUES(" + i + ")" for i in insert_df.columns])
    .replace("[", "")
    .replace("]", "")
    .replace("'", "")
)
cursor.executemanycolumns(f"INSERT INTO {table_name} {cols} VALUES {params} ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), " + on_duplicate_key_update_stmts + ", modified_at=NOW();", values)

I also tried a row-wise insert like:

cursor.executemany(f"INSERT INTO {table_name} {cols} VALUES {params} ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), " + on_duplicate_key_update_stmts + ", modified_at=NOW();", list(insert_df.to_numpy()))

but this also took about 190s.

My attempt using sqlalchemy looks just like the above cursor.executemany(query, data), but with the engine made from sqlalchemy and it uses %s to denote parameter substitutions:

INSERT INTO schema.table(my, forty, six, column, names, here, ...) VALUES ( %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE updated_at = if(coalesce(data_change_hash,0) <> values(data_change_hash), NOW(),updated_at), updated_at_micro_ts = if(coalesce(data_change_hash,0) <> values(data_change_hash),NOW(6),updated_at_micro_ts), col1 = VALUES(col1), col2 = VALUES(col2), ...,  modified_at=NOW();

Any help would be much appreciated.

edited Jun 6 at 8:29

Progman

20.1k7 gold badges58 silver badges88 bronze badges

asked Jun 6 at 0:00

GBPU

6844 silver badges19 bronze badges

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Extremely slow DB insert using Turbodbc

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest