Insert CSV through Pandas to SQLITE: How to avoid the Memory Error?

Question

I experience the memory error when trying write pandas dataframe from CSV into SQLITE database. The CSV file has 430 MB and 6 000 000 lines.

For smaller files it works absolutely alright. However I would like to know how to avoid the Memory error for bigger files.

The reading by chunks works fine and correctly prints 6 000 000 lines in 20 000 line chunks. However the script wants to transfer the whole 6 000 000 lines into the SQLITE database+table and it gives the following error:

Traceback (most recent call last):
  File "C:/SQLITELOAD1.py", line 42, in <module>
    .rename(columns=dict(zip(big_data.columns, listofcol)))
  File "C:\Python37\site-packages\pandas\util\_decorators.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "C:\Python37\site-packages\pandas\core\frame.py", line 4025, in rename
    return super(DataFrame, self).rename(**kwargs)
  File "C:\Python37\site-packages\pandas\core\generic.py", line 1091, in rename
    level=level)
  File "C:\Python37\site-packages\pandas\core\internals\managers.py", line 170, in rename_axis
    obj = self.copy(deep=copy)
  File "C:\Python37\site-packages\pandas\core\internals\managers.py", line 734, in copy
    do_integrity_check=False)
  File "C:\Python37\site-packages\pandas\core\internals\managers.py", line 395, in apply
    applied = getattr(b, f)(**kwargs)
  File "C:\Python37\site-packages\pandas\core\internals\blocks.py", line 753, in copy
    values = values.copy()
MemoryError

The code:

import csv, sqlite3, time, os, ctypes
from sqlalchemy import create_engine
import pandas as pd
datab = 'NORTHWIND'
con=sqlite3.connect(datab+'.db')
con.text_factory = str  
cur = con.cursor()
koko = 'C:\\NORTHWIND'
print(koko)
directory = koko 
print(directory)

for file in os.listdir(directory):
    for searchfile, listofcol, table in zip(['1251_FINAL.csv'],
                                [['SYS', 'MANDT', 'AGR_NAME', 'OBJECT', 'AUTH', 'FIELD', 'LOW', 'HIGH', 'DELETED']],                  
                                 ['AGR_1251_ALL2']):

                    if file.endswith(searchfile):

                            fileinsert = directory + '\\' + searchfile
                            my_list = []
                            for chunk in pd.read_csv(fileinsert, sep=",",error_bad_lines=False, encoding='latin-1', low_memory=False, chunksize=20000):
                                    my_list.append(chunk)
                                    print(chunk)
                            big_data = pd.concat(my_list, axis = 0)
                            print(big_data)
                            del my_list 
                            (big_data
                             .rename(columns=dict(zip(big_data.columns, listofcol)))
                             .to_sql(name=table,
                                    con=con,
                                     if_exists="replace",
                                     chunksize=20000,
                                     index=False,
                                     index_label=None))

Serge Ballesta · Accepted Answer · 2020-03-06 12:32:00Z

When you insert records in a SQL database, two sizes are to be considered:

the size of an individual INSERT
the global size between consecutive COMMIT

Because until the bunch of requests are commited, the database has to be able to rollback everything, so nothing is definitively written.

For the description of the symptoms, I can guess that to_sql uses the chunksize parameter as the size on an INSERT but uses one single COMMIT when the whole operation is terminated.

There is no direct fix, but the common way when loading a large record set in a database is to use intermediary COMMIT requests to allow some cleanup in the database. Said differently you should use one to_sql per chunk. It forces you to explicitely drop the table before the loop, use if_exists="append" and be ready to clean everything if things go wrong, but I know no better way...

learning2learn · Accepted Answer · 2020-03-06 12:21:07Z

1

I guess your implied question is how do I fix this? Consider rephrasing it.

Anyway, I think it's just failing due to a limit, and that's it.

Consider using:

if_exists="append"

answered Mar 6, 2020 at 12:21

learning2learn

4115 silver badges13 bronze badges

1 Comment

Kokokoko Over a year ago

if_exists="append" unfortunately does not solve anything. I guess the question now is how to write it in chunks, the same way is it is read in chunks. Any ideas?

Collectives™ on Stack Overflow

Insert CSV through Pandas to SQLITE: How to avoid the Memory Error?

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related