Import Data to SQL using Python

Question

I'm going to need to import 30k rows of data from a CSV file into a Vertica database. The code I've tried with is taking more than an hour to do so. I'm wondering if there's a faster way to do it? I've tried to import using csv and also by looping through a dataframe to insert, but it just isn't fast enough. Infact, it's way too slow. Could you please help me?

rownum=df.shape[0]
for x in range(0,rownum):
 a=df['AccountName'].values[x]
 b=df['ID'].values[x]
 ss="INSERT INTO Table (AccountName,ID) VALUES (%s,%s)"
 val=(a,b)
 cur.execute(ss,val)

connection.commit()

An immediate attempt might be val = (df['AccountName'].values.tolist(), df['ID'].values.tolist()) and then use cur.executemany(ss, val) without the for loop. That should be faster but I'm not sure if there might be further improvements. Also, 1 space of indentation makes this code difficult to read; are you sure that connection.commit() is definitely not in the for loop - it only takes 1 space to make that mistake. I suggest you go by PEP8 rules and use 4 spaces — roganjosh
– roganjosh, Commented Jan 28, 2019 at 16:56

A. Saunders · Accepted Answer · 2019-01-28 19:35:48Z

2

You want to use the COPY command (COPY).

COPY Table FROM '/path/to/csv/file.csv' DELIMITER ',';

This is much faster than inserting each row at a time.

Since you are using python, I would recommend the vertica_python module as it has a very convenient copy method on it's cursor object (vertica-python GitHub page).

The syntax for using COPY with vertica-python is as follows:

with open('file.csv', 'r') as file:
    csv_file = file.read()
    copy_cmd = "COPY Table FROM STDIN DELIMITER ','"
    cur.copy(copy_cmd, csv_file)
    connection.commit()

Another thing you can do to speed up the process is compress the csv file. Vertica can read gzip, bzip and lzo compressed files.

with open('file.csv.gz', 'r') as file:
    gzipped_csv_file = file.read()
    copy_cmd = "COPY Table FROM STDIN GZIP DELIMITER ','"
    cur.copy(copy_cmd, gzipped_csv_file)
    connection.commit()

Copying compressed files will reduce network time. So you have to determine if the extra time it takes to compress the csv file is made up for in the time saved copying the compressed files. In most cases I've dealt with, it is worth it to compress the file.

answered Jan 28, 2019 at 19:35

A. Saunders

8752 gold badges6 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Lax Mandis Over a year ago

Hey, thanks so much for your help. I'm able to get the time to less than a minute! This is awesome, thanks so much!

Lax Mandis Over a year ago

It's just that, not all rows are getting copied though. I have about 21900 rows and I only see 21300 in my sql table. What do you think about this?

Lax Mandis Over a year ago

Hi A.Saunders, could you help me out as to why some rows are getting removed? Thanks!

A. Saunders Over a year ago

@LaxMandis You can use REJECTED DATA and EXCEPTIONS to find out what rows are missing and why. You need to specify a path for each. Rejected data will show which rows were not copied, and exceptions shows what the error was. The syntax is COPY Table FROM STDIN GZIP DELIMITER ',' REJECTED DATA '/path/to/rejections.txt' EXCEPTIONS '/path/to/exceptions.txt';. I always include these options when I use the copy statement, but left them out of my explanation above, for simplicity.

Collectives™ on Stack Overflow

Import Data to SQL using Python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related