2

I experience the memory error when trying write pandas dataframe from CSV into SQLITE database. The CSV file has 430 MB and 6 000 000 lines.

For smaller files it works absolutely alright. However I would like to know how to avoid the Memory error for bigger files.

The reading by chunks works fine and correctly prints 6 000 000 lines in 20 000 line chunks. However the script wants to transfer the whole 6 000 000 lines into the SQLITE database+table and it gives the following error:

Traceback (most recent call last):
  File "C:/SQLITELOAD1.py", line 42, in <module>
    .rename(columns=dict(zip(big_data.columns, listofcol)))
  File "C:\Python37\site-packages\pandas\util\_decorators.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "C:\Python37\site-packages\pandas\core\frame.py", line 4025, in rename
    return super(DataFrame, self).rename(**kwargs)
  File "C:\Python37\site-packages\pandas\core\generic.py", line 1091, in rename
    level=level)
  File "C:\Python37\site-packages\pandas\core\internals\managers.py", line 170, in rename_axis
    obj = self.copy(deep=copy)
  File "C:\Python37\site-packages\pandas\core\internals\managers.py", line 734, in copy
    do_integrity_check=False)
  File "C:\Python37\site-packages\pandas\core\internals\managers.py", line 395, in apply
    applied = getattr(b, f)(**kwargs)
  File "C:\Python37\site-packages\pandas\core\internals\blocks.py", line 753, in copy
    values = values.copy()
MemoryError

The code:

import csv, sqlite3, time, os, ctypes
from sqlalchemy import create_engine
import pandas as pd
datab = 'NORTHWIND'
con=sqlite3.connect(datab+'.db')
con.text_factory = str  
cur = con.cursor()
koko = 'C:\\NORTHWIND'
print(koko)
directory = koko 
print(directory)

for file in os.listdir(directory):
    for searchfile, listofcol, table in zip(['1251_FINAL.csv'],
                                [['SYS', 'MANDT', 'AGR_NAME', 'OBJECT', 'AUTH', 'FIELD', 'LOW', 'HIGH', 'DELETED']],                  
                                 ['AGR_1251_ALL2']):

                    if file.endswith(searchfile):

                            fileinsert = directory + '\\' + searchfile
                            my_list = []
                            for chunk in pd.read_csv(fileinsert, sep=",",error_bad_lines=False, encoding='latin-1', low_memory=False, chunksize=20000):
                                    my_list.append(chunk)
                                    print(chunk)
                            big_data = pd.concat(my_list, axis = 0)
                            print(big_data)
                            del my_list 
                            (big_data
                             .rename(columns=dict(zip(big_data.columns, listofcol)))
                             .to_sql(name=table,
                                    con=con,
                                     if_exists="replace",
                                     chunksize=20000,
                                     index=False,
                                     index_label=None))

2 Answers 2

4

When you insert records in a SQL database, two sizes are to be considered:

  • the size of an individual INSERT
  • the global size between consecutive COMMIT

Because until the bunch of requests are commited, the database has to be able to rollback everything, so nothing is definitively written.

For the description of the symptoms, I can guess that to_sql uses the chunksize parameter as the size on an INSERT but uses one single COMMIT when the whole operation is terminated.

There is no direct fix, but the common way when loading a large record set in a database is to use intermediary COMMIT requests to allow some cleanup in the database. Said differently you should use one to_sql per chunk. It forces you to explicitely drop the table before the loop, use if_exists="append" and be ready to clean everything if things go wrong, but I know no better way...

Sign up to request clarification or add additional context in comments.

Comments

1

I guess your implied question is how do I fix this? Consider rephrasing it.

Anyway, I think it's just failing due to a limit, and that's it.

Consider using:

if_exists="append"

1 Comment

if_exists="append" unfortunately does not solve anything. I guess the question now is how to write it in chunks, the same way is it is read in chunks. Any ideas?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.