8

I am running into performance issues with Pandas and writing DataFrames to an SQL DB. In order to be as fast as possible I use memSQL (it's like MySQL in code, so I don't have to do anything). I benchmarked my instance just now:

docker run --rm -it --link=memsql:memsql memsql/quickstart simple-benchmark
Creating database simple_benchmark
Warming up workload
Launching 10 workers
Workload will take approximately 30 seconds.
Stopping workload
42985000 rows inserted using 10 threads
1432833.3 rows per second

That isn't glorious, and it's just my local laptop. I know... I am also using the root user, but it's a throw-away Docker container.

Here is the code which writes my DataFrame to the DB:

    import MySQLdb

    import mysql.connector
    from sqlalchemy import create_engine
    from pandas.util.testing import test_parallel

    engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)
    # max_allowed_packet = 1000M in mysql.conf
    # no effect

    # @test_parallel(num_threads=8)
    def commit_flows(netflow_df2):
        % time netflow_df2.to_sql(name='netflow_ids', con=engine, if_exists = 'append', index=False, chunksize=500)
    commit_flows(netflow_df2)

Below is the %time measurement of the function.

Multi-threading does not make this faster. It remains within 7000 - 8000 rows/s.

CPU times: user 2min 6s, sys: 1.69 s, total: 2min 8s Wall time: 2min 18s

Screenshot: memSQL shows the speed

I also increased the max_allowed_packet size to commit in bulk, with a larger chunk size. Still not faster.

Here is the shape of the DataFrame:

netflow_df2.shape
(1015391, 20)

Does anyone know how I can make this faster?

0

1 Answer 1

2

In case someones gets a similar situation:

I removed SQlalchemy and used the (deprecated) MySQL flavor for Pandas' to_sql() function. The speedup is more than 120 %. I don't recommend to use this, but it works for me at the moment.

import MySQLdb

import mysql.connector
from sqlalchemy import create_engine
from pandas.util.testing import test_parallel

engine = MySQLdb.connect("127.0.0.1","root","","netflow_test")

# engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)

# @test_parallel(num_threads=8)
def commit_flows(netflow_df2):
    % time netflow_df2.to_sql(name='netflow_ids', flavor='mysql', con=engine, if_exists = 'append', index=False, chunksize=50000)
commit_flows(netflow_df2)

If I find out how to convince memSQL to accept a large query (similar to MySQL's max_allowed_packet = 1000M in mysql.conf) I will be even faster. I should be able to hit more than 50000 rows per second here.

CPU times: user 28.7 s, sys: 797 ms, total: 29.5 s
Wall time: 38.2 s

126s before. 38.2s now.

Sign up to request clarification or add additional context in comments.

3 Comments

MemSQL also has the max_allowed_packet setting - docs.memsql.com/docs/memsqlcnf.
I have been testing this, but it seems I also need to adapt some other variables. If I chose a bulk size of 60 000 I get SQL syntax errors back, and the query is cut.
Even using to_sql is quite slow. I have moved to duckdb github.com/cwida/duckdb. Reading and writing dataframe into it is blazingly fast. It is an embedded database though. Check it out to see if it meets your needs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.