Using COPY FROM stdin to load tables, reading input file only once

Question

I've got a large (~60 million row) fixed width source file with ~1800 records per row.

I need to load this file into 5 different tables on an instance of Postgres 8.3.9.

My dilemma is that, because the file is so large, I'd like to have to read it only once.

This is straightforward enough using INSERT or COPY as normal, but I'm trying to get a load speed boost by including my COPY FROM statements in a transaction that includes a TRUNCATE--avoiding logging, which is supposed to give a considerable load speed boost (according to http://www.cirrusql.com/node/3). As I understand it, you can disable logging in Postgres 9.x--but I don't have that option on 8.3.9.

The script below has me reading the input file twice, which I want to avoid... any ideas on how I could accomplish this by reading the input file only once? Doesn't have to be bash--I also tried using psycopg2, but couldn't figure out how to stream file output into the COPY statement as I'm doing below. I can't COPY FROM file because I need to parse it on the fly.

#!/bin/bash

table1="copytest1"
table2="copytest2"

#note: $1 refers to the first argument used when invoking this script
#which should be the location of the file one wishes to have python
#parse and stream out into psql to be copied into the data tables

( echo 'BEGIN;'
  echo 'TRUNCATE TABLE ' ${table1} ';'
  echo 'COPY ' ${table1} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat $1 | python2.5 ~/parse_${table1}.py 
  echo '\.'
  echo 'TRUNCATE TABLE ' ${table2} ';'
  echo 'COPY ' ${table2} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat $1 | python2.5 ~/parse_${table2}.py 
  echo '\.'
  echo 'COMMIT;'
) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name

exit 0

Thanks!

I have implemented something similar using parsing in Python and streaming via STDIN into PostgreSQL. I did it in an ugly way, though, not using PSYCOPG2. My problem is that I do not understand your code. How come this code is able to load something into 5 different tables? What is happening in your Python programs. — David
– David, Commented Mar 5, 2011 at 19:40
Hi @David--I'm sorry, I should be clearer. The code above is a simplified example, using only 2 tables instead of 5. I ended up abandoning the above in favor of an approach based on @johnbaum's very helpful tip--see below. That said, regarding the above--I've added some comments to the code above that hopefully will clarify a little bit. Above, the Python scripts are just reading fixed-width input from the standard input stream, separating it into a string of tab-delimited values, and then sending to the standard output stream, where it is being redirected to psql using the pipe "|". — Stew
– Stew, Commented Mar 9, 2011 at 16:22

johnbaum · Accepted Answer · 2011-03-05 23:10:45Z

2

You could use named pipes instead your anonymous pipe. With this concept your python script could fill the tables through different psql processes with the corresponding data.

Create pipes:

mkfifo fifo_table1
mkfifo fifo_table2

Run psql instances:

psql db_name < fifo_table1 &
psql db_name < fifo_table2 &

Your python script would look about so (Pseudocode):

SQL_BEGIN = """
BEGIN;
TRUNCATE TABLE %s;
COPY %s FROM STDIN WITH NULL AS '';
"""
fifo1 = open('fifo_table1', 'w')
fifo2 = open('fifo_table2', 'w')

bigfile = open('mybigfile', 'r')

print >> fifo1, SQL_BEGIN % ('table1', 'table1') #ugly, with python2.6 you could use .format()-Syntax     
print >> fifo2, SQL_BEGIN % ('table2', 'table2')      

for line in bigfile:
  # your code, which decides where the data belongs to
  # if data belongs to table1
  print >> fifo1, data
  # else
  print >> fifo2, data

print >> fifo1, 'COMMIT;'
print >> fifo2, 'COMMIT;'

fifo1.close()
fifo2.close()

Maybe this is not the most elegant solution, but it should work.

answered Mar 5, 2011 at 23:10

johnbaum

6644 silver badges5 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Stew Over a year ago

this is what I was wondering about how to do, and I ended up implementing it very successfully. the previous version of this import script took several hours to run. the one I wrote with this approach took 54 minutes. thanks!

Stew Over a year ago

for those unfamiliar with named pipes: they are persistent, like files, so you should consider building in a statement which removes them when you're done using them: os.system("rm fifo_table1") for example, using the os module.

score 2 · Accepted Answer · 2011-03-09 15:58:57Z

2

Why use COPY for the second table? I would assume that doing a:

INSERT INTO table2 (...)
SELECT ...
FROM table1;

would be faster than using COPY.

Edit
If you need to import different rows into different tables but from the same source file, maybe inserting everything into a staging table and then inserting the rows from there into the target tables is faster:

Import the .whole* text file into one staging table:

COPY staging_table FROM STDIN ...;

After that step, the whole input file is in staging_table

Then copy the rows from the staging table to the individual target tables by selecting only those that qualify for the corresponding table:

INSERT INTO table_1 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_1);

INSERT INTO table_2 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_2);

This is of course only feasible if you have enough space in your database to keep the staging table around.

edited Mar 9, 2011 at 15:58

answered Mar 5, 2011 at 14:48

user330315

6 Comments

Stew Over a year ago

I think you're right, @a_horse_with_no_name, but I wasn't clear in my question; each line of the file contains 1 row of data for each of the 5 tables--e.g. table2 is not derived from table1 data--it's non-overlapping. That's why the lines are being parsed with different python scripts. Re: the 2nd truncate--you're right, I left that out by accident. Thanks!

user330315 Over a year ago

@Stew: Ah, I see. So then this is obviously not a solution.

Stew Over a year ago

@a_horse_with_no_name actually, when I was thinking about it, this could have been a potential solution, as I could have done something like:

Stew Over a year ago

1. create staging_table(1-5) 2. begin transaction for staging_table1, truncate staging_table1 3. copy staging_table1 from stdin 4. print tab-delimited batch of data to the sole psql instance 5. commit 6. insert select * from staging_table1 into permanent table 7. (repeat steps 2-6 for remaining tables) 8. (repeat steps 2-8 for the next batch, until you have loaded all of your data).

Stew Over a year ago

@a_horse_with_no_name gah! and now that I re-read your suggestion, I suspect that I originally misunderstood it. you are suggesting that I copy my data in whole lines, into the staging table, then select from that the portions that need to be inserted into the permanent tables. this would work well, I think!

|

Collectives™ on Stack Overflow

Using COPY FROM stdin to load tables, reading input file only once

2 Answers 2

2 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related