3

I have 200,000 XML files I want to parse and store in a database.

Here is an example of one: https://gist.github.com/902292

This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.

What I am wondering is:

1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.

2) Where is a simple tutorial on said parser? (DOM or SAX)

Thanks

EDIT

I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.

However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.

Here is part of the project. https://gist.github.com/905550#file_xm_lparser.java

Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.

Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).

Thanks

2
  • Perhaps the memory problem is not caused by DOM objects. In the example you don't show the database code. If you are using JDBC that can use up memory if not released correctly. The Java DOM and other objects should be GC'd when you no longer reference them so look for objects to which references are being held. A memory profiler would help. Commented Apr 6, 2011 at 13:25
  • There's actually no database code yet. Commented Apr 7, 2011 at 14:09

6 Answers 6

4

Why not use a proper XML database (like Berkeley DB XML)? Then you can just dump the documents in directly, and create indices as needed (e.g. on the HotelID).

Sign up to request clarification or add additional context in comments.

1 Comment

I'm a huge fan of MongoDB but I will certainly check that out. Always interested in learning something new.
3

divide and conquer Split 200,000 files into multiple buckets and parallelize the parse/insert. Look at Java 5 Executors if you want to keep it simple or use spring-batch if this is a recurring task in which case you can benefit from a high level framework.

API

Use of SAX can help but not necessary as you are not going to keep the parsed model around (i.e. all you are doing is parsing, inserting and then let go of the parsed data at which time the objects are eligible for GC). Look into a simple API like JDOM.

Other ideas

You can implement a producer/consumer kind of model where producer produces the pojo's created after parsing and consumer takes the pojo's and inserts them into db. the advantage here is that you can batch the inserts to gain more performance.

1 Comment

Nice suggestions. Fortunately, the XML files are split equally into 100 folders.
2

SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.

  • Lalith

2 Comments

sax also has a better memory footprint.
I'm accepting this as the answer because I also believe a good thread pool could rip through these much faster.
2

Go with SAX, or if you want, StAX. Forget about DOM. Use an effective library like aalto.

I am sure that parsing will be quite cheap compared to making the database requests.

But 200k is not such a big number if you only need to do this once.

4 Comments

Thanks for the tip. I will actually need to run this nightly at some point. But weekly or even monthly in the beginning.
In that case, you can also consider transforming your data to either a more effective storage format - or a more effective database import format. See github.com/eishay/jvm-serializers/wiki. - although I recommend doing incremental updates instead - as the data is coming in, not nightly jobs.
Thanks. But the data comes in a format that I can't control. It is dumped to these 200k XML files every night. I am simply parsing them and storing them in a MongoDB to be searchable.
You might want to check out how performance goes if you convert to a file import format the DB understands (csv etc). And go with Aalto and the multithreading suggestion also.
0

SAX will be faster than DOM, this could well be an issue if you have 200,000 files to parse.

Comments

0

StAX is faster then SAX and this is much faster then DOM. If performance is super critical you can also think about building a special compiler to parse the XML files. But usually lexing and parsing is not that much of an issue with StAX but the "after-processing".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.