How to deal with HUGE dump xml file(about 70g) and save it to Mysql using Java or Python [duplicate]

Question

I download Stack Overflow posts dump file to do my work.When I unpacked .7z file , the .xml dump file exceed 65G.

I want to parse the .xml file, because there lots of contents I need not. And then I want to store the usage contents to Mysql database.(Java or Python are both ok)

But the file is too large to handle it for me, it can overflow my memory(8G).

What can I do to solve the tricky problem.

You can use a SAX parser to perform (certain) queries, this will handle the file as a stream, and thus not parse the (full) file into memory. — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Sep 29, 2018 at 10:42

Jaan Oras · Accepted Answer · 2018-09-29 10:48:15Z

1

There are essentially two kinds of XML parsers, DOM parsers and SAX parsers.

DOM parsers parse the whole XML to a DOM(a representation of the XML in the memory) which is easy to use and manipulate, but must be loaded to memory.

SAX parsers are stream parsers, these parse the XML file and essentially emit starts and ends of XML elements. It means that the file is not loaded in the memory. This makes handling of the XML more complicated in most cases, but you can work on files that do not fit into memory.

Thus pick and language you like more and use SAX parser. Python has it built in, not sure about Java(I have not worked with it for years) but there are probably tons of options.

answered Sep 29, 2018 at 10:48

Jaan Oras

2141 silver badge3 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to deal with HUGE dump xml file(about 70g) and save it to Mysql using Java or Python [duplicate]

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related