I've got a large (~60 million row) fixed width source file with ~1800 records per row.
I need to load this file into 5 different tables on an instance of Postgres 8.3.9.
My dilemma is that, because the file is so large, I'd like to have to read it only once.
This is straightforward enough using INSERT or COPY as normal, but I'm trying to get a load speed boost by including my COPY FROM statements in a transaction that includes a TRUNCATE--avoiding logging, which is supposed to give a considerable load speed boost (according to http://www.cirrusql.com/node/3). As I understand it, you can disable logging in Postgres 9.x--but I don't have that option on 8.3.9.
The script below has me reading the input file twice, which I want to avoid... any ideas on how I could accomplish this by reading the input file only once? Doesn't have to be bash--I also tried using psycopg2, but couldn't figure out how to stream file output into the COPY statement as I'm doing below. I can't COPY FROM file because I need to parse it on the fly.
#!/bin/bash
table1="copytest1"
table2="copytest2"
#note: $1 refers to the first argument used when invoking this script
#which should be the location of the file one wishes to have python
#parse and stream out into psql to be copied into the data tables
( echo 'BEGIN;'
echo 'TRUNCATE TABLE ' ${table1} ';'
echo 'COPY ' ${table1} ' FROM STDIN'
echo "WITH NULL AS '';"
cat $1 | python2.5 ~/parse_${table1}.py
echo '\.'
echo 'TRUNCATE TABLE ' ${table2} ';'
echo 'COPY ' ${table2} ' FROM STDIN'
echo "WITH NULL AS '';"
cat $1 | python2.5 ~/parse_${table2}.py
echo '\.'
echo 'COMMIT;'
) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name
exit 0
Thanks!