-1

I download Stack Overflow posts dump file to do my work.When I unpacked .7z file , the .xml dump file exceed 65G.

I want to parse the .xml file, because there lots of contents I need not. And then I want to store the usage contents to Mysql database.(Java or Python are both ok)

But the file is too large to handle it for me, it can overflow my memory(8G).

What can I do to solve the tricky problem.

1
  • You can use a SAX parser to perform (certain) queries, this will handle the file as a stream, and thus not parse the (full) file into memory. Commented Sep 29, 2018 at 10:42

1 Answer 1

1

There are essentially two kinds of XML parsers, DOM parsers and SAX parsers.

DOM parsers parse the whole XML to a DOM(a representation of the XML in the memory) which is easy to use and manipulate, but must be loaded to memory.

SAX parsers are stream parsers, these parse the XML file and essentially emit starts and ends of XML elements. It means that the file is not loaded in the memory. This makes handling of the XML more complicated in most cases, but you can work on files that do not fit into memory.

Thus pick and language you like more and use SAX parser. Python has it built in, not sure about Java(I have not worked with it for years) but there are probably tons of options.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.