1

I have a PostgreSQL database with a certain structure and I have several million of xml files. I have to parse each file and, get certain data and fill the tables in the database. What I want to know is the most optimal language/framework/algorithm to perform this routine.
I wrote a program in C# (Mono) using DbLinq ORM. It does not use threading, it just parses file by file, filles table objects and submits certain group of objects (for example 200) to the database. It appears to be rather slow: it processes about 400 files per minute and it will take about a month to finish the job.
I ask for your thoughts and tips.

1
  • I would figure out if your program is bottlenecked on reading & parsing the XML files, or submitting data to the database. Unless you have massive amounts of text data, I would guess the former. Commented Jan 26, 2011 at 16:19

2 Answers 2

1

I think it would be faster when you'll use small programs in a pipe that will:

  • join your files into one big stream;

  • parse input stream and generate an output stream in PostgreSQL COPY format - the same format pg_dump uses when creating backups, similar to tab-separated-values, looks like this:

COPY table_name (table_id, table_value) FROM stdin;
1   value1
2   value2
3   value3
\.
  • load COPY stream into Postgresq started temporarily with "-F" option to disable fsync calls.

For example on Linux:

find -name \*.xml -print0 | xargs -0 cat \
  | parse_program_generating_copy \
  | psql dbname

Using COPY is much faster than inserting with ORM. Joining files will parallelise reading and writing to database. Disabling "fsync" will allow for big speedup, but will require restoring a database from backup if a server crashes during loading.

Sign up to request clarification or add additional context in comments.

Comments

0

Generally I believe Perl is a good option for parsing tasks. I do not know Perl myself. It sounds to me is that you have so extreme performance demands that you might need to create an XML parser as the performance of a standard one might become bottleneck (you should test this before you start implementing). I myself use Python and psycopg2 to communicate with Postgres.

Whichever language you choose, you certainly want to use COPY FROM and probably stdin using Perl/Python/other language to feed data into Postgres.

Instead of spending a lot of time optimizing everything, you could also use a suboptimal solution and run it in extreme parallel on say 100 EC2 instances. This would be a lot cheaper than spending hours and hours on finding the optimal solution.

Without knowing anything about the size of the files 400 files per minute does not sound TOO bad. Ask yourself whether it is worth spending a week of development to reduce the time to a third or just running it now and wait for a month.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.