19

I want to migrate data from a large csv file to sqlite3 database.

My code on Python 3.5 using pandas:

con = sqlite3.connect(DB_FILENAME)
df = pd.read_csv(MLS_FULLPATH)
df.to_sql(con=con, name="MLS", if_exists="replace", index=False)

Is it possible to print current status (progress bar) of execution of to_sql method?

I looked the article about tqdm, but didn't find how to do this.

3 Answers 3

38

Unfortuantely DataFrame.to_sql does not provide a chunk-by-chunk callback, which is needed by tqdm to update its status. However, you can process the dataframe chunk by chunk:

import sqlite3
import pandas as pd
from tqdm import tqdm

DB_FILENAME='/tmp/test.sqlite'

def chunker(seq, size):
    # from http://stackoverflow.com/a/434328
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def insert_with_progress(df, dbfile):
    con = sqlite3.connect(dbfile)
    chunksize = int(len(df) / 10) # 10%
    with tqdm(total=len(df)) as pbar:
        for i, cdf in enumerate(chunker(df, chunksize)):
            replace = "replace" if i == 0 else "append"
            cdf.to_sql(con=con, name="MLS", if_exists=replace, index=False)
            pbar.update(chunksize)
            
df = pd.DataFrame({'a': range(0,100000)})
insert_with_progress(df, DB_FILENAME)

Note I'm generating the DataFrame inline here for the sake of having a complete workable example without dependency.

The result is quite stunning:

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

my csv file takes 1.7 GB place on the disk, so df=pd.read_csv(csv_filename, ...) works very slow. But I found the solution here: stackoverflow.com/a/28371706/5856795, so your answer and answer @sebastian-raschka help me to do this task.
With range() in stead of xrange() this also works in Python 3. Very nicely, I must say!
7

I wanted to share a variant of the solution posted by miraculixx - that I had to alter for SQLAlchemy:

#these need to be customized - myDataFrame, myDBEngine, myDBTable

df=myDataFrame

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def insert_with_progress(df):
    con = myDBEngine.connect()
    chunksize = int(len(df) / 10)
    with tqdm(total=len(df)) as pbar:
        for i, cdf in enumerate(chunker(df, chunksize)):
            replace = "replace" if i == 0 else "append"
            cdf.to_sql(name="myDBTable", con=conn, if_exists=replace, index=False) 
            pbar.update(chunksize)
            tqdm._instances.clear()

insert_with_progress(df)

1 Comment

You defined the variable replace but don't use it. Did you mean if_exists=replace?
0

User miraculixx has a nice example above, thank you for that. But if you want to use it with files of all sizes you should add something like this:

chunksize = int(len(df) / 10)
if chunksize == 0:
    df.to_sql(con=con, name="MLS", if_exists="replace", index=False)
else:
    with tqdm(total=len(df)) as pbar:
    ...

1 Comment

Is there anyway you could finish the example you posted above? When I set the integer in the chunksize variable, I only get that amount into my db. e.g., chunksize = int(len(df) / 10) then only 1/10 of the total records are being recorded into my db.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.