How can I efficiently parse 200,000 XML files in Java?

Question

I have 200,000 XML files I want to parse and store in a database.

Here is an example of one: https://gist.github.com/902292

This is about as complex as the XML files get. This will also run on a small VPS (Linode) so memory is tight.

What I am wondering is:

1) Should I use a DOM or SAX parser? DOM seems easier and faster since each XML is small.

2) Where is a simple tutorial on said parser? (DOM or SAX)

Thanks

EDIT

I tried the DOM route even though everyone suggested SAX. Mainly because I found an "easier" tutorial for DOM and I thought that since the average file size was about 3k - 4k it would easily be able to hold that in memory.

However, I wrote a recursive routine to handle all 200k files and it gets about 40% of the way through them and then Java runs out of memory.

Here is part of the project. https://gist.github.com/905550#file_xm_lparser.java

Should I ditch DOM now and just use SAX? Just seems like with such small files DOM should be able to handle it.

Also, the speed is "fast enough". It's taking about 19 seconds to parse 2000 XML files (before the Mongo insert).

Thanks

Perhaps the memory problem is not caused by DOM objects. In the example you don't show the database code. If you are using JDBC that can use up memory if not released correctly. The Java DOM and other objects should be GC'd when you no longer reference them so look for objects to which references are being held. A memory profiler would help. — Kevin Brock
– Kevin Brock, Commented Apr 6, 2011 at 13:25

porges · Accepted Answer · 2011-04-05 01:02:28Z

4

Why not use a proper XML database (like Berkeley DB XML)? Then you can just dump the documents in directly, and create indices as needed (e.g. on the HotelID).

answered Apr 5, 2011 at 1:02

porges

30.7k4 gold badges88 silver badges115 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

cbmeeks Over a year ago

I'm a huge fan of MongoDB but I will certainly check that out. Always interested in learning something new.

Aravind Yarram · Accepted Answer · 2011-04-05 00:02:30Z

3

divide and conquer Split 200,000 files into multiple buckets and parallelize the parse/insert. Look at Java 5 Executors if you want to keep it simple or use spring-batch if this is a recurring task in which case you can benefit from a high level framework.

API

Use of SAX can help but not necessary as you are not going to keep the parsed model around (i.e. all you are doing is parsing, inserting and then let go of the parsed data at which time the objects are eligible for GC). Look into a simple API like JDOM.

Other ideas

You can implement a producer/consumer kind of model where producer produces the pojo's created after parsing and consumer takes the pojo's and inserts them into db. the advantage here is that you can batch the inserts to gain more performance.

answered Apr 5, 2011 at 0:02

Aravind Yarram

80.5k49 gold badges239 silver badges335 bronze badges

1 Comment

cbmeeks Over a year ago

Nice suggestions. Fortunately, the XML files are split equally into 100 folders.

Lalith · Accepted Answer · 2011-04-04 23:52:48Z

2

SAX always beats DOM at speed. But since you say XML files are small you may proceed with DOM parser. One thing you can do to speedup is create a Thread-Pool and do the database operations in it. Multithreaded updates will significantly improve the performance.

Lalith

answered Apr 4, 2011 at 23:52

Lalith

20.8k18 gold badges47 silver badges54 bronze badges

2 Comments

MeBigFatGuy Over a year ago

sax also has a better memory footprint.

cbmeeks Over a year ago

I'm accepting this as the answer because I also believe a good thread pool could rip through these much faster.

ThomasRS · Accepted Answer · 2011-04-04 23:58:18Z

2

Go with SAX, or if you want, StAX. Forget about DOM. Use an effective library like aalto.

I am sure that parsing will be quite cheap compared to making the database requests.

But 200k is not such a big number if you only need to do this once.

edited Apr 4, 2011 at 23:58

answered Apr 4, 2011 at 23:50

ThomasRS

8,2966 gold badges37 silver badges51 bronze badges

4 Comments

cbmeeks Over a year ago

Thanks for the tip. I will actually need to run this nightly at some point. But weekly or even monthly in the beginning.

ThomasRS Over a year ago

In that case, you can also consider transforming your data to either a more effective storage format - or a more effective database import format. See github.com/eishay/jvm-serializers/wiki. - although I recommend doing incremental updates instead - as the data is coming in, not nightly jobs.

cbmeeks Over a year ago

Thanks. But the data comes in a format that I can't control. It is dumped to these 200k XML files every night. I am simply parsing them and storing them in a MongoDB to be searchable.

ThomasRS Over a year ago

You might want to check out how performance goes if you convert to a file import format the DB understands (csv etc). And go with Aalto and the multithreading suggestion also.

Jim Blackler · Accepted Answer · 2011-04-04 23:51:07Z

0

SAX will be faster than DOM, this could well be an issue if you have 200,000 files to parse.

answered Apr 4, 2011 at 23:51

Jim Blackler

23.3k12 gold badges90 silver badges101 bronze badges

Comments

SyntaxSamurai · Accepted Answer · 2011-05-10 09:05:05Z

0

StAX is faster then SAX and this is much faster then DOM. If performance is super critical you can also think about building a special compiler to parse the XML files. But usually lexing and parsing is not that much of an issue with StAX but the "after-processing".

answered May 10, 2011 at 9:05

SyntaxSamurai

1,4683 gold badges25 silver badges20 bronze badges

Collectives™ on Stack Overflow

How can I efficiently parse 200,000 XML files in Java?

6 Answers 6

1 Comment

1 Comment

2 Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

1 Comment

2 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related