Filling PostgreSQL database with large amount of data

Question

I have a PostgreSQL database with a certain structure and I have several million of xml files. I have to parse each file and, get certain data and fill the tables in the database. What I want to know is the most optimal language/framework/algorithm to perform this routine.
I wrote a program in C# (Mono) using DbLinq ORM. It does not use threading, it just parses file by file, filles table objects and submits certain group of objects (for example 200) to the database. It appears to be rather slow: it processes about 400 files per minute and it will take about a month to finish the job.
I ask for your thoughts and tips.

I would figure out if your program is bottlenecked on reading & parsing the XML files, or submitting data to the database. Unless you have massive amounts of text data, I would guess the former. — Thanatos
– Thanatos, Commented Jan 26, 2011 at 16:19

Tometzky · Accepted Answer · 2011-01-26 14:18:14Z

I think it would be faster when you'll use small programs in a pipe that will:

join your files into one big stream;
parse input stream and generate an output stream in PostgreSQL COPY format - the same format pg_dump uses when creating backups, similar to tab-separated-values, looks like this:

COPY table_name (table_id, table_value) FROM stdin;
1   value1
2   value2
3   value3
\.

load COPY stream into Postgresq started temporarily with "-F" option to disable fsync calls.

For example on Linux:

find -name \*.xml -print0 | xargs -0 cat \
  | parse_program_generating_copy \
  | psql dbname

Using COPY is much faster than inserting with ORM. Joining files will parallelise reading and writing to database. Disabling "fsync" will allow for big speedup, but will require restoring a database from backup if a server crashes during loading.

David · Accepted Answer · 2011-01-26 11:41:24Z

Generally I believe Perl is a good option for parsing tasks. I do not know Perl myself. It sounds to me is that you have so extreme performance demands that you might need to create an XML parser as the performance of a standard one might become bottleneck (you should test this before you start implementing). I myself use Python and psycopg2 to communicate with Postgres.

Whichever language you choose, you certainly want to use COPY FROM and probably stdin using Perl/Python/other language to feed data into Postgres.

Instead of spending a lot of time optimizing everything, you could also use a suboptimal solution and run it in extreme parallel on say 100 EC2 instances. This would be a lot cheaper than spending hours and hours on finding the optimal solution.

Without knowing anything about the size of the files 400 files per minute does not sound TOO bad. Ask yourself whether it is worth spending a week of development to reduce the time to a third or just running it now and wait for a month.

Collectives™ on Stack Overflow

Filling PostgreSQL database with large amount of data

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related