Prune some elements from large xml file [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed 9 years ago.

Improve this question

I have a xml file of more than 1GB and I want to reduce the size of the file by removing unwanted children of a parent tag by creating a new xml file or rewriting the existing one. How this can be done through python as the file is large,simple parse tree = ElementTree.parse(xmlfile) won't work.

XML file

In the file for every parent tag TasksReportNode I want to have only the child TableRow with RowCount attribute with value 0 and reject all other children(Table Row) of that parent.

Sample XML code:

<TasksReportNode Name="Task15">
    <TableData NumRows="97" NumColumns="15">
        <TableRow RowCount="0">
            <TableColumn Name="Task"><![CDATA[   Task15 [GET - /PULSEV31/appView/projectFeedHidden.jsp - 200]]]></TableColumn>
            <TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
            <TableColumn Name="Successful"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Total"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Min(ms)"><![CDATA[15]]></TableColumn>
            <TableColumn Name="Avg(ms)"><![CDATA[24.20]]></TableColumn>
            <TableColumn Name="Avg-90%(ms)"><![CDATA[54.55]]></TableColumn>
            <TableColumn Name="90%ile(ms)"><![CDATA[89.98]]></TableColumn>
            <TableColumn Name="95%ile(ms)"><![CDATA[95.24]]></TableColumn>
            <TableColumn Name="99%ile(ms)"><![CDATA[99.45]]></TableColumn>
            <TableColumn Name="Max(ms)"><![CDATA[94]]></TableColumn>
            <TableColumn Name="Std. Dev."><![CDATA[15.74]]></TableColumn>
            <TableColumn Name="Bytes Recd(KB)"><![CDATA[192]]></TableColumn>
        </TableRow>
        <TableRow RowCount="1">
            <TableColumn Name="Task"><![CDATA[      VirtualUser1]]></TableColumn>
            <TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
            <TableColumn Name="Successful"><![CDATA[1]]></TableColumn>
            <TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Total"><![CDATA[1]]></TableColumn>
            <TableColumn Name="Min(ms)"><![CDATA[934]]></TableColumn>
            <TableColumn Name="Avg(ms)"><![CDATA[934.00]]></TableColumn>
            <TableColumn Name="Avg-90%(ms)"><![CDATA[950.00]]></TableColumn>
            <TableColumn Name="90%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
            <TableColumn Name="95%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
            <TableColumn Name="99%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
            <TableColumn Name="Max(ms)"><![CDATA[934]]></TableColumn>
            <TableColumn Name="Std. Dev."><![CDATA[0.00]]></TableColumn>
            <TableColumn Name="Bytes Recd(KB)"><![CDATA[0]]></TableColumn>
    </TableData>
    <TableData NumRows="1" NumColumns="2">
        <TableRow RowCount="0">
            <TableColumn Name="Response Time Interval (ms)"><![CDATA[0 - 99]]></TableColumn>
            <TableColumn Name="Frequency"><![CDATA[96]]></TableColumn>
        </TableRow>
    </TableData>
</TasksReportNode>
<TasksReportNode Name="Task16">
    <TableData NumRows="97" NumColumns="15">
        <TableRow RowCount="0">
            <TableColumn Name="Task"><![CDATA[   Task16 [GET - /PULSEV31/appView/projectCommentHidden.jsp - 200]]]></TableColumn>
            <TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
            <TableColumn Name="Successful"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
            <TableColumn Name="Total"><![CDATA[96]]></TableColumn>
            <TableColumn Name="Min(ms)"><![CDATA[15]]></TableColumn>
            <TableColumn Name="Avg(ms)"><![CDATA[22.73]]></TableColumn>
            <TableColumn Name="Avg-90%(ms)"><![CDATA[54.55]]></TableColumn>
            <TableColumn Name="90%ile(ms)"><![CDATA[90.93]]></TableColumn>
            <TableColumn Name="95%ile(ms)"><![CDATA[96.25]]></TableColumn>
            <TableColumn Name="99%ile(ms)"><![CDATA[100.50]]></TableColumn>
            <TableColumn Name="Max(ms)"><![CDATA[109]]></TableColumn>
            <TableColumn Name="Std. Dev."><![CDATA[14.76]]></TableColumn>
            <TableColumn Name="Bytes Recd(KB)"><![CDATA[192]]></TableColumn>
        </TableRow>
    </TableData>
</TasksReportNode>

Here is what I have tried:

xmL = 'F:\\Reports\\Logs\\Result_TG1_V16.xml'

context = etree.iterparse(xmL,  events=("start", "end"),)
for event, element in context:
if element.tag == 'TasksReportNode':
    for child1 in element:
        for child2 in child1:
        if child2.get("RowCount") == "0":
            for child3 in child2:
            print(child3.tag, child3.text)
element.clear() # discard the element
del context

Now we have all the RowCount with value '0' and that can be added to parent, leaving all other siblings.

Wondering why this question is being down voted without any comments for reasoning? Is it due to a google drive link? Interested to know — Tejas Pendse
– Tejas Pendse, Commented Mar 17, 2016 at 11:20
@TejasPendse Is anything wrong with the question? Should I remove the drive link — siddhu619
– siddhu619, Commented Mar 17, 2016 at 12:09
@siddhu619 To me your question seems clear and as you have provided short example of XML in the question, the link to larger file on GDrive is correct. Keep it as it is. — Jan Vlcinsky
– Jan Vlcinsky, Commented Mar 17, 2016 at 14:38

Jan Vlcinsky · Accepted Answer · 2016-03-17 17:13:47Z

I would recommend using lxml as it is in most regards more efficient than stdlib xml.ElementTree.

You shall not attempt to parse the whole document as a whole as it is too large, but should approach the source document iteratively.

At lxml pages is Event driven parsing

There are two options:

etree.iterparse
using custom parser, firing SAX-like events

I personally prefer the etree.iterparse as it gives you parsed elements in much more convenient way. But you must not forget to do the clean-up work on processed parts, otherwise you will not save any memory comparing to parsing the whole document at once.

EDIT: added real example

Example talks better then tons of theories. Here is my attempt:

from lxml import etree

# fname = "large.xml"  # 78 MB
fname = "verylarge.xml"  # 773 MB

toremove = []

for event, element in etree.iterparse(fname):
    if element.tag == "TableRow":
        if element.attrib["RowCount"] != "0":
            element.clear()
            # removing current element causes segmentation fault
            # element.getparent().remove(element)
            toremove.append(element)
    if element.tag == "TableData":
        for rowelm in toremove:
            element.remove(rowelm)
        toremove = []

# last processed element is the root one
with open("out.xml", "w") as f:
    f.write(etree.tostring(element))

To test the performance, I took your large sample file (73 MB), repeated inner part 10 times, got 773 MB large XML file and processed that.

The processing took 24 seconds (zenbook core i7 with 4 GB RAM) and resulting file was 4.7 MB large.

Example explained

iterparse is by default providing only "end" events, fired when some element is completely parsed.

This solution uses the fact, that even with iterparse, the elements are kept in memory. This is used in following places:

during iterparse, not needed elements are cleared (element.clear()) and removed (element.remove(rowelm)). The clear() clears the inner content of the element, but the element still exists. The remove() works on parent element and removes the inner part from it.
elements which are to be used are not removed and cleared, so we find them at the end present in the root element.
finally, when all is processed, last processed element is the root one. It is still in memory, so I can write it as string to a file.

One has to be careful when to remove() the element. Trying to remove the element from parent at the moment it was currently iterated element caused segmentation fault. For this reason the code waits with "TableRow" element remove() until we complete parsing of parent TableData element.

Variable toremove is used to collect all "TableRow" elements and is used as soon as parent "TableData" element is completely parsed. Note, that remove() works only on real element parents, so we shall be sure we do it in proper time.

Ideas for even larger files

For even larger files, this solution would be limited by size of resulting XML document as it is kept in memory till the pruning of the source XML is completed.

For such scenarios, we would have to use writing out the output during parsing and getting rid of all elements in memory, which are already processed. In practice, you would have to write out "opening XML element" part (like "<TaskReportSummary att="a" otheratt="bb") when "start" event would appear, and write clossing XML element part "/>" at "end" event.

siddhu619 · Accepted Answer · 2016-03-17 14:05:21Z

-1

Here is what I have tried:

xmL = 'F:\\Reports\\Logs\\Result_TG1_V16.xml'


context = etree.iterparse(xmL,  events=("start", "end"),)
for event, element in context:
if element.tag == 'TasksReportNode':
    for child1 in element:
        for child2 in child1:
        if child2.get("RowCount") == "0":
            for child3 in child2:
            print(child3.tag, child3.text)
element.clear() # discard the element
del context

Now we have all the RowCount with value '0' and that can be added to parent, leaving all other siblings.

edited Mar 17, 2016 at 14:05

answered Mar 17, 2016 at 13:58

siddhu619

614 silver badges17 bronze badges

4 Comments

Jan Vlcinsky Over a year ago

Your iterparse is registering both "start" and "end" events. For this reason you probably get empty list at for child1 in element, as these are not parsed yet at that moment. Note, that your attempts are better added into your question (unless you come to real working solution serving as an answer). People trying to answer your question can then easier see your effort.

Tejas Pendse Over a year ago

If this isn't a solution you're happy with, you should add this to the question instead of adding it to an answer

siddhu619 Over a year ago

@TejasPendse This solved my problem .

siddhu619 Over a year ago

@JanVlcinsky I have executed the above code and I am not getting empty list for child1 in element

Collectives™ on Stack Overflow

Prune some elements from large xml file [closed]

2 Answers 2

Example explained

Ideas for even larger files

1 Comment

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Example explained

Ideas for even larger files

1 Comment

4 Comments

Linked

Related