Large XML File Parsing in Python

Question

I have an XML file of size 4 GB. I want to parse it and convert it to a Data Frame to work on it. But because the file size is too large the following code is unable to convert the file to a Pandas Data Frame. The code just keeps loading and does not provide any output. But when I use it for a similar file of smaller size I obtain the correct output.

Can anyone suggest any solution to this. Maybe a code that speeds up the process of conversion from XML to Data Frame or splitting of the XML file into smaller sub sets.

Any suggestion whether I should work with such large XML files on my personal system (2 GB RAM) or I should use Google Colab. If Google Colab, then is there any way to upload such large files quicker to drive and thus to Colab?

Following is the code I had used:

import xml.etree.ElementTree as ET
tree = ET.parse("Badges.xml")
root = tree.getroot()

#Column names for DataFrame
columns = ['row Id',"UserId",'Name','Date','Class','TagBased']

#Creating DataFrame
df = pd.DataFrame(columns = columns)

#Converting XML Tree to a Pandas DataFrame

for node in root: 
    
    row_Id = node.attrib.get("Id")
    UserId = node.attrib.get("UserId")
    Name = node.attrib.get("Name")
    Date = node.attrib.get("Date")
    Class = node.attrib.get("Class")
    TagBased = node.attrib.get("TagBased")
    
    df = df.append(pd.Series([row_Id,UserId,Name,Date,Class,TagBased], index = columns), ignore_index = True)

Following is my XML File:

<badges>
  <row Id="82946" UserId="3718" Name="Teacher" Date="2008-09-15T08:55:03.923" Class="3" TagBased="False" />
  <row Id="82947" UserId="994" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82949" UserId="3893" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82950" UserId="4591" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82951" UserId="5196" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82952" UserId="2635" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />
  <row Id="82953" UserId="1113" Name="Teacher" Date="2008-09-15T08:55:03.957" Class="3" TagBased="False" />

An other approach instead of parsing the whole XML as a whole, is to first create chunks of say 250MB large, and parse them in parallel. Or course, this only works if the XML is a long list-like structure of say transactions, people, or items where you know what to expect. You could have a reader and jump to around 250MB in the file and find the right cutting point, and re-jump again etc... — Willem Hendriks
– Willem Hendriks, Commented Jun 25, 2020 at 15:33
One serious performance issue you experience: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. — Parfait
– Parfait, Commented Jun 25, 2020 at 15:36
@user3184950 Can you please write the code for converting the XML file into small chunks. I tried it but I am facing problem. — Ishan Dutta
– Ishan Dutta, Commented Jun 25, 2020 at 16:12
@Parfait can you please suggest an alternative code to what you mentioned. — Ishan Dutta
– Ishan Dutta, Commented Jun 25, 2020 at 16:13
Splitting the large files largely depends on the structure of your XML. A topic very searchable and find example code on, like here on stack — Willem Hendriks
– Willem Hendriks, Commented Jun 30, 2020 at 13:35

Julien Palard · Accepted Answer · 2022-02-14 13:16:36Z

6

Consider iterparse for fast streaming processing that builds tree incrementally. In each iteration build a list of dictionaries that you can then pass into pandas.DataFrame constructor once outside loop. Adjust below to name of repeating nodes of root's children:

from xml.etree.ElementTree import iterparse
#from cElementTree import iterparse
import pandas as pd

file_path = r"/path/to/Input.xml"
dict_list = []

for _, elem in iterparse(file_path, events=("end",)):
    if elem.tag == "row":
        dict_list.append({'rowId': elem.attrib['Id'],
                          'UserId': elem.attrib['UserId'],
                          'Name': elem.attrib['Name'],
                          'Date': elem.attrib['Date'],
                          'Class': elem.attrib['Class'],
                          'TagBased': elem.attrib['TagBased']})

        # dict_list.append(elem.attrib)      # ALTERNATIVELY, PARSE ALL ATTRIBUTES

        elem.clear()

df = pd.DataFrame(dict_list)

edited Feb 14, 2022 at 13:16

Julien Palard

12.1k3 gold badges42 silver badges48 bronze badges

answered Jun 25, 2020 at 16:19

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Ishan Dutta Over a year ago

I used the code given by you, it took 40 mins to load but I have got many error which are mentioned below.

Ishan Dutta Over a year ago

I have added the XML file, please have a look.

Ishan Dutta Over a year ago

The code which I had written was giving no output because it was taking too long time to append everytime in the DataFrame, using your approach to append in the dictionary not only gave me an output but it took fairly less amount of time.

Ishan Dutta Over a year ago

I used the same code for an almost same XML file but am getting a key error which should not happen. Please have a look to the question if possible. Link: stackoverflow.com/questions/62660270/…

Tom Johnson Over a year ago

If you are only interested in looking at 'row' tags then you can specify that in the call to interparse using for _, elem in iterparse(file_path, events=('end',), tag='row'): Then you can get rid of your if elem.tag test.

Mads Hansen · Accepted Answer · 2020-06-25 15:24:04Z

4

Consider using cElementTree instead of ElementTree

https://effbot.org/zone/celementtree.htm

The cElementTree module is a C implementation of the ElementTree API, optimized for fast parsing and low memory use. On typical documents, cElementTree is 15-20 times faster than the Python version of ElementTree, and uses 2-5 times less memory.

The cElementTree module is designed to replace the ElementTree module from the standard elementtree package. In theory, you should be able to simply change:

from elementtree import ElementTree

to

import cElementTree as ElementTree

answered Jun 25, 2020 at 15:24

Mads Hansen

67.6k12 gold badges119 silver badges154 bronze badges

1 Comment

Ishan Dutta Over a year ago

I tried with the cElementTree, but I am facing the same problem. The code is stuck and the output is not being produced. It just keeps loading. Can you suggest some other version of the code or a code to convert my xml file into smaller files?

Collectives™ on Stack Overflow

Large XML File Parsing in Python

2 Answers 2

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related