9

I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:

df.to_sql('my_table', con, index=False)

It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.

  1. If I try the method in:

    Bulk Insert A Pandas DataFrame Using SQLAlchemy

    then I get a no attribute copy_from error.

  2. If I try the multithreading method from:

    http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

    then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.

Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?

2
  • are you writing into existing table or will it be created? Commented Jan 9, 2017 at 19:03
  • I would use this or similar approach - BCP should be very fast Commented Jan 9, 2017 at 19:06

3 Answers 3

8

in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit

chunksize=1000, method='multi'

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

New in version 0.24.0.

The parameter method controls the SQL insertion clause used. Possible values are:

None: Uses standard SQL INSERT clause (one per row). 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.

Sign up to request clarification or add additional context in comments.

2 Comments

I keep getting PrestoUserError when I try to_sql with a Presto connection. Could you please share any example of Pandas writing to Presto? I obtained the connection to Presto using prestodb.dbapi.connect API
It is trying this query first SELECT name FROM sqlite_master WHERE type='table' AND name=?; and complains about ; at the end :-/
6

I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

1 Comment

Using this every day and is fast, very fast!
3

even I had the same issue so I applied sqlalchemy with fast execute many.

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

always make sure that the given function should be present after the engine variable and before cursor execute.

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

1 Comment

adding decorator the will cause the issue: ('HY090', '[HY090] [Microsoft][ODBC Driver Manager] Invalid string or buffer length (0) (SQLBindParameter)')

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.