Pandas to_sql() performance - why is it so slow?

Question

I am running into performance issues with Pandas and writing DataFrames to an SQL DB. In order to be as fast as possible I use memSQL (it's like MySQL in code, so I don't have to do anything). I benchmarked my instance just now:

docker run --rm -it --link=memsql:memsql memsql/quickstart simple-benchmark
Creating database simple_benchmark
Warming up workload
Launching 10 workers
Workload will take approximately 30 seconds.
Stopping workload
42985000 rows inserted using 10 threads
1432833.3 rows per second

That isn't glorious, and it's just my local laptop. I know... I am also using the root user, but it's a throw-away Docker container.

Here is the code which writes my DataFrame to the DB:

    import MySQLdb

    import mysql.connector
    from sqlalchemy import create_engine
    from pandas.util.testing import test_parallel

    engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)
    # max_allowed_packet = 1000M in mysql.conf
    # no effect

    # @test_parallel(num_threads=8)
    def commit_flows(netflow_df2):
        % time netflow_df2.to_sql(name='netflow_ids', con=engine, if_exists = 'append', index=False, chunksize=500)
    commit_flows(netflow_df2)

Below is the %time measurement of the function.

Multi-threading does not make this faster. It remains within 7000 - 8000 rows/s.

CPU times: user 2min 6s, sys: 1.69 s, total: 2min 8s Wall time: 2min 18s

Screenshot:

I also increased the max_allowed_packet size to commit in bulk, with a larger chunk size. Still not faster.

Here is the shape of the DataFrame:

netflow_df2.shape
(1015391, 20)

Does anyone know how I can make this faster?

wishi · Accepted Answer · 2016-10-03 22:22:02Z

2

In case someones gets a similar situation:

I removed SQlalchemy and used the (deprecated) MySQL flavor for Pandas' to_sql() function. The speedup is more than 120 %. I don't recommend to use this, but it works for me at the moment.

import MySQLdb

import mysql.connector
from sqlalchemy import create_engine
from pandas.util.testing import test_parallel

engine = MySQLdb.connect("127.0.0.1","root","","netflow_test")

# engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)

# @test_parallel(num_threads=8)
def commit_flows(netflow_df2):
    % time netflow_df2.to_sql(name='netflow_ids', flavor='mysql', con=engine, if_exists = 'append', index=False, chunksize=50000)
commit_flows(netflow_df2)

If I find out how to convince memSQL to accept a large query (similar to MySQL's max_allowed_packet = 1000M in mysql.conf) I will be even faster. I should be able to hit more than 50000 rows per second here.

CPU times: user 28.7 s, sys: 797 ms, total: 29.5 s
Wall time: 38.2 s

126s before. 38.2s now.

answered Oct 3, 2016 at 22:22

wishi

7,43717 gold badges67 silver badges105 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jack Chen Over a year ago

MemSQL also has the max_allowed_packet setting - docs.memsql.com/docs/memsqlcnf.

wishi Over a year ago

I have been testing this, but it seems I also need to adapt some other variables. If I chose a bulk size of 60 000 I get SQL syntax errors back, and the query is cut.

Wei Qiu Over a year ago

Even using to_sql is quite slow. I have moved to duckdb github.com/cwida/duckdb. Reading and writing dataframe into it is blazingly fast. It is an embedded database though. Check it out to see if it meets your needs.

Collectives™ on Stack Overflow

Pandas to_sql() performance - why is it so slow?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related