0

I'm going to need to import 30k rows of data from a CSV file into a Vertica database. The code I've tried with is taking more than an hour to do so. I'm wondering if there's a faster way to do it? I've tried to import using csv and also by looping through a dataframe to insert, but it just isn't fast enough. Infact, it's way too slow. Could you please help me?

rownum=df.shape[0]
for x in range(0,rownum):
 a=df['AccountName'].values[x]
 b=df['ID'].values[x]
 ss="INSERT INTO Table (AccountName,ID) VALUES (%s,%s)"
 val=(a,b)
 cur.execute(ss,val)

connection.commit()
1
  • An immediate attempt might be val = (df['AccountName'].values.tolist(), df['ID'].values.tolist()) and then use cur.executemany(ss, val) without the for loop. That should be faster but I'm not sure if there might be further improvements. Also, 1 space of indentation makes this code difficult to read; are you sure that connection.commit() is definitely not in the for loop - it only takes 1 space to make that mistake. I suggest you go by PEP8 rules and use 4 spaces Commented Jan 28, 2019 at 16:56

1 Answer 1

2

You want to use the COPY command (COPY).

COPY Table FROM '/path/to/csv/file.csv' DELIMITER ',';

This is much faster than inserting each row at a time.

Since you are using python, I would recommend the vertica_python module as it has a very convenient copy method on it's cursor object (vertica-python GitHub page).

The syntax for using COPY with vertica-python is as follows:

with open('file.csv', 'r') as file:
    csv_file = file.read()
    copy_cmd = "COPY Table FROM STDIN DELIMITER ','"
    cur.copy(copy_cmd, csv_file)
    connection.commit()

Another thing you can do to speed up the process is compress the csv file. Vertica can read gzip, bzip and lzo compressed files.

with open('file.csv.gz', 'r') as file:
    gzipped_csv_file = file.read()
    copy_cmd = "COPY Table FROM STDIN GZIP DELIMITER ','"
    cur.copy(copy_cmd, gzipped_csv_file)
    connection.commit()

Copying compressed files will reduce network time. So you have to determine if the extra time it takes to compress the csv file is made up for in the time saved copying the compressed files. In most cases I've dealt with, it is worth it to compress the file.

Sign up to request clarification or add additional context in comments.

4 Comments

Hey, thanks so much for your help. I'm able to get the time to less than a minute! This is awesome, thanks so much!
It's just that, not all rows are getting copied though. I have about 21900 rows and I only see 21300 in my sql table. What do you think about this?
Hi A.Saunders, could you help me out as to why some rows are getting removed? Thanks!
@LaxMandis You can use REJECTED DATA and EXCEPTIONS to find out what rows are missing and why. You need to specify a path for each. Rejected data will show which rows were not copied, and exceptions shows what the error was. The syntax is COPY Table FROM STDIN GZIP DELIMITER ',' REJECTED DATA '/path/to/rejections.txt' EXCEPTIONS '/path/to/exceptions.txt';. I always include these options when I use the copy statement, but left them out of my explanation above, for simplicity.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.