1

I need to parse a very large XML file of 30 GB into CSV. I have 15 GB of RAM available. I have been looking at some alternatives, for instance xmltodict, which has some sort of Streaming option, but this creates a dictionary, which then I am not able to load in order to save it as a CSV.

What tools are available (preferably in Python) to parse such big XML files to CSV, where it is not possible to load nor process the file because of RAM limits?

3 Answers 3

3

One possibility is a streaming XSLT 3.0 processor, which given your constraints means in practice Saxon/C Enterprise Edition (this has a Python language binding).

There is actually a CSV-to-XML stylesheet published as a worked example in the XSLT 3.0 specification, but sadly no counterpart to do the reverse. However, you can see the principle in some of the answers here:

https://stackoverflow.com/questions/365312/xml-to-csv-using-xslt

or here:

https://stackoverflow.com/questions/15226194/xml-to-csv-using-xslt

To make the code streamable, the key constraint is that any template rule or for-each instruction that processes a particular element can only make one traversal of the element's children. That means you can't, for example, do one pass of the source XML to discover the field names and then another pass to process the values.

Note: Saxon-EE is a commercial product and I have a commercial interest in it.

1

The XML Utilities library is worth a try, assuming valid & flat XML structure - it even comes with a command line xml2csv utility.

It specifically states:

xmlutils.py is a set of Python utilities for processing xml files serially for converting them to various formats (SQL, CSV, JSON). The scripts use ElementTree.iterparse() to iterate through nodes in an XML document, thus not needing to load the entire DOM into memory. The scripts can be used to churn through large XML files (albeit taking long :P) without memory hiccups.

0

To efficiently parse a large 30GB XML file into CSV format using Python, you might want to consider using Sonra Flexter. Flexter is designed to handle massive XML datasets smoothly and convert them into a more manageable CSV format. This can greatly simplify your data processing tasks, particularly when dealing with extensive XML files. For more information on how Flexter can assist with your specific needs, visit Sonra Flexter. This tool ensures a high level of accuracy and efficiency in converting large XML files, making it an excellent choice for your project.

1
  • Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. Commented Jul 7, 2024 at 19:35

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.