I have a xml file of more than 1GB and I want to reduce the size of the file by removing unwanted children of a parent tag by creating a new xml file or rewriting the existing one. How this can be done through python as the file is large,simple parse tree = ElementTree.parse(xmlfile) won't work.
In the file for every parent tag TasksReportNode I want to have only the child TableRow with RowCount attribute with value 0 and reject all other children(Table Row) of that parent.
Sample XML code:
<TasksReportNode Name="Task15">
<TableData NumRows="97" NumColumns="15">
<TableRow RowCount="0">
<TableColumn Name="Task"><![CDATA[ Task15 [GET - /PULSEV31/appView/projectFeedHidden.jsp - 200]]]></TableColumn>
<TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
<TableColumn Name="Successful"><![CDATA[96]]></TableColumn>
<TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
<TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
<TableColumn Name="Total"><![CDATA[96]]></TableColumn>
<TableColumn Name="Min(ms)"><![CDATA[15]]></TableColumn>
<TableColumn Name="Avg(ms)"><![CDATA[24.20]]></TableColumn>
<TableColumn Name="Avg-90%(ms)"><![CDATA[54.55]]></TableColumn>
<TableColumn Name="90%ile(ms)"><![CDATA[89.98]]></TableColumn>
<TableColumn Name="95%ile(ms)"><![CDATA[95.24]]></TableColumn>
<TableColumn Name="99%ile(ms)"><![CDATA[99.45]]></TableColumn>
<TableColumn Name="Max(ms)"><![CDATA[94]]></TableColumn>
<TableColumn Name="Std. Dev."><![CDATA[15.74]]></TableColumn>
<TableColumn Name="Bytes Recd(KB)"><![CDATA[192]]></TableColumn>
</TableRow>
<TableRow RowCount="1">
<TableColumn Name="Task"><![CDATA[ VirtualUser1]]></TableColumn>
<TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
<TableColumn Name="Successful"><![CDATA[1]]></TableColumn>
<TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
<TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
<TableColumn Name="Total"><![CDATA[1]]></TableColumn>
<TableColumn Name="Min(ms)"><![CDATA[934]]></TableColumn>
<TableColumn Name="Avg(ms)"><![CDATA[934.00]]></TableColumn>
<TableColumn Name="Avg-90%(ms)"><![CDATA[950.00]]></TableColumn>
<TableColumn Name="90%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
<TableColumn Name="95%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
<TableColumn Name="99%ile(ms)"><![CDATA[1,000.50]]></TableColumn>
<TableColumn Name="Max(ms)"><![CDATA[934]]></TableColumn>
<TableColumn Name="Std. Dev."><![CDATA[0.00]]></TableColumn>
<TableColumn Name="Bytes Recd(KB)"><![CDATA[0]]></TableColumn>
</TableData>
<TableData NumRows="1" NumColumns="2">
<TableRow RowCount="0">
<TableColumn Name="Response Time Interval (ms)"><![CDATA[0 - 99]]></TableColumn>
<TableColumn Name="Frequency"><![CDATA[96]]></TableColumn>
</TableRow>
</TableData>
</TasksReportNode>
<TasksReportNode Name="Task16">
<TableData NumRows="97" NumColumns="15">
<TableRow RowCount="0">
<TableColumn Name="Task"><![CDATA[ Task16 [GET - /PULSEV31/appView/projectCommentHidden.jsp - 200]]]></TableColumn>
<TableColumn Name="Status"><![CDATA[Success]]></TableColumn>
<TableColumn Name="Successful"><![CDATA[96]]></TableColumn>
<TableColumn Name="Failed"><![CDATA[0]]></TableColumn>
<TableColumn Name="Timedout"><![CDATA[0]]></TableColumn>
<TableColumn Name="Total"><![CDATA[96]]></TableColumn>
<TableColumn Name="Min(ms)"><![CDATA[15]]></TableColumn>
<TableColumn Name="Avg(ms)"><![CDATA[22.73]]></TableColumn>
<TableColumn Name="Avg-90%(ms)"><![CDATA[54.55]]></TableColumn>
<TableColumn Name="90%ile(ms)"><![CDATA[90.93]]></TableColumn>
<TableColumn Name="95%ile(ms)"><![CDATA[96.25]]></TableColumn>
<TableColumn Name="99%ile(ms)"><![CDATA[100.50]]></TableColumn>
<TableColumn Name="Max(ms)"><![CDATA[109]]></TableColumn>
<TableColumn Name="Std. Dev."><![CDATA[14.76]]></TableColumn>
<TableColumn Name="Bytes Recd(KB)"><![CDATA[192]]></TableColumn>
</TableRow>
</TableData>
</TasksReportNode>
Here is what I have tried:
xmL = 'F:\\Reports\\Logs\\Result_TG1_V16.xml'
context = etree.iterparse(xmL, events=("start", "end"),)
for event, element in context:
if element.tag == 'TasksReportNode':
for child1 in element:
for child2 in child1:
if child2.get("RowCount") == "0":
for child3 in child2:
print(child3.tag, child3.text)
element.clear() # discard the element
del context
Now we have all the RowCount with value '0' and that can be added to parent, leaving all other siblings.