6

I have an XML file of size 4 GB. I want to parse it and convert it to a Data Frame to work on it. But because the file size is too large the following code is unable to convert the file to a Pandas Data Frame. The code just keeps loading and does not provide any output. But when I use it for a similar file of smaller size I obtain the correct output.

Can anyone suggest any solution to this. Maybe a code that speeds up the process of conversion from XML to Data Frame or splitting of the XML file into smaller sub sets.

Any suggestion whether I should work with such large XML files on my personal system (2 GB RAM) or I should use Google Colab. If Google Colab, then is there any way to upload such large files quicker to drive and thus to Colab?

Following is the code I had used:

import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()

#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']

#Creating DataFrame
df = pd.DataFrame(columns = columns)

#Converting XML Tree to a Pandas DataFrame

for node in root: 
    
    row_Id = node.attrib.get("Id")
    UserId = node.attrib.get("UserId")
    Name = node.attrib.get("Name")
    Date = node.attrib.get("Date")
    Class = node.attrib.get("Class")
    TagBased = node.attrib.get("TagBased")
    
    df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)

Following is my XML File:

<badges>
  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
5
  • An other approach instead of parsing the whole XML as a whole, is to first create chunks of say 250MB large, and parse them in parallel. Or course, this only works if the XML is a long list-like structure of say transactions, people, or items where you know what to expect. You could have a reader and jump to around 250MB in the file and find the right cutting point, and re-jump again etc... Commented Jun 25, 2020 at 15:33
  • One serious performance issue you experience: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Commented Jun 25, 2020 at 15:36
  • @user3184950 Can you please write the code for converting the XML file into small chunks. I tried it but I am facing problem. Commented Jun 25, 2020 at 16:12
  • @Parfait can you please suggest an alternative code to what you mentioned. Commented Jun 25, 2020 at 16:13
  • Splitting the large files largely depends on the structure of your XML. A topic very searchable and find example code on, like here on stack Commented Jun 30, 2020 at 13:35

2 Answers 2

6

Consider iterparse for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame constructor once outside loop. Adjust below to name of repeating nodes of root's children:

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/path/to/Input.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['UserId'],
                          'Name': elem.attrib['Name'],
                          'Date': elem.attrib['Date'],
                          'Class': elem.attrib['Class'],
                          'TagBased': elem.attrib['TagBased']})

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES

        elem.clear()

df = pd.DataFrame(dict_list)
Sign up to request clarification or add additional context in comments.

5 Comments

I used the code given by you, it took 40 mins to load but I have got many error which are mentioned below.
I have added the XML file, please have a look.
The code which I had written was giving no output because it was taking too long time to append everytime in the DataFrame, using your approach to append in the dictionary not only gave me an output but it took fairly less amount of time.
I used the same code for an almost same XML file but am getting a key error which should not happen. Please have a look to the question if possible. Link: stackoverflow.com/questions/62660270/…
If you are only interested in looking at 'row' tags then you can specify that in the call to interparse using for _, elem in iterparse(file_path, events=('end',), tag='row'): Then you can get rid of your if elem.tag test.
4

Consider using cElementTree instead of ElementTree

https://effbot.org/zone/celementtree.htm

The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.

The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:

from elementtree import ElementTree

to

import cElementTree as ElementTree

1 Comment

I tried with the cElementTree, but I am facing the same problem. The code is stuck and the output is not being produced. It just keeps loading. Can you suggest some other version of the code or a code to convert my xml file into smaller files?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.