Convert Deeply Nested XML to CSV in Python [duplicate]

Question

I'm new to Python and have heard that it is one of the best ways to parse fairly large XML files (150MB). I can't get my head around how to iterate through the tags and extract only the <hw> and <defunit> tags as it's fairly deeply nested.

I have some XML formatted as below, and I need to extract the "hw" and "defunit" tags from it using Python and convert them into a .csv format.

<?xml version="1.0" encoding="UTF-8"?>
<dps-data xmlns="urn:DPS2-metadata" project="SCRABBLELARGE" guid="7d6b7164fde1e064:34368a61:14306b637ab:-8000--4a25ae5c-c104-4c7a-bba5-b434dd4d9314">
    <superentry xmlns="urn:COLL" xmlns:d="urn:COLL" xmlns:e="urn:IDMEE" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-336">
        <entry publevel="1" id="a000001" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-335">
            <hwblk>
                <hwgrp>
                    <hwunit>
                        <hw>aa</hw>
                        <ulsrc>edsh</ulsrc>
                    </hwunit>
                </hwgrp>
            </hwblk>
            <datablk>
                <gramcat publevel="1" id="a000001.001">
                    <pospgrp>
                        <pospunit>
                            <posp value="noun" />
                        </pospunit>
                    </pospgrp>
                    <sensecat id="a000001.001.01" publevel="1">
                        <defgrp>
                            <defunit>
                                <def>volcanic rock</def>
                            </defunit>
                        </defgrp>
                    </sensecat>
                </gramcat>
            </datablk>
        </entry>
    </superentry>
  </dps-data>

The .csv format I'd like to see it in is simply:

hw, defunit
aa, volcanic rock

What are the repeating elements: <superentry>, <entry>? Usually it's best to post first few repeating items of XML so we understand the tree. — Parfait
– Parfait, Commented Sep 19, 2016 at 16:32
@Parfait Apologies, I was only allowed so much code. <superentry> and <entry> are the two repeating elements, yes! — Powellellogram
– Powellellogram, Commented Sep 26, 2016 at 8:10

VergeA · Accepted Answer · 2016-09-26 15:06:02Z

3

The lxml library is capable of very powerful XML parsing, and can be used to iterate over an XML tree to search for specific elements.

from lxml import etree

with open(r'path/to/xml', 'r') as xml:
    text = xml.read()
tree = lxml.etree.fromstring(text)
row = ['', '']
for item in tree.iter('hw', 'def'):
    if item.tag == 'hw':
       row[0] = item.text
    elif item.tag == 'def':
       row[1] = item.text

line = ','.join(row)

with open(r'path/to/csv', 'a') as csv:
     csv.write(line + '\n')

How you build the CSV file is largely based upon preference, but I have provided a trivial example above. If there are multiple <dps-data> tags, you could extract those elements first (which can be done with the same tree.iter method shown above), and then apply the above logic to each of them.

EDIT: I should point out that this particular implementation reads the entire XML file into memory. If you are working with a single 150mb file at a time, this should not be a problem, but it's just something to be aware of.

edited Sep 26, 2016 at 15:06

answered Sep 19, 2016 at 15:43

VergeA

896 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Powellellogram Over a year ago

I get the error "can only concatenate list (not "str") to list" when using this script. I understand I need to convert the value of row to string but am unsure how!

VergeA Over a year ago

Can you post your full error message? Additionally, I realized that the script should be searching for def, not defunit, as the def tag contains the text that you want.

Powellellogram Over a year ago

C:\Users\...\Local\Programs\Python\Python35-32\python.exe C:/Users/.../PycharmProjects/xmltocsv2/xmltocsv.py Traceback (most recent call last):   File "C:/Users/mbpowell/PycharmProjects/xmltocsv2/xmltocsv.py", line 18, in <module>     csv.write(row + '\n') TypeError: can only concatenate list (not "str") to list

is the error message I get.

VergeA Over a year ago

I have found my mistake and edited the code. The issues was that when I called ','.join(row), I did not assign the result to a variable, meaning that the string produced was never saved for later use. Use the new line variable instead.

Powellellogram Over a year ago

Yes I tried that variable fix too. The script then runs with no errors, but the csv file has nothing written to it.

|

picmate 涅 · Accepted Answer · 2016-09-19 16:01:27Z

2

How about this:

from xml.dom import minidom

xmldoc = minidom.parse('your.xml')
hw_lst = xmldoc.getElementsByTagName('hw')
defu_lst = xmldoc.getElementsByTagName('def')

with open('your.csv', 'a') as out_file:
    for i in range(len(hw_lst)):
        out_file.write('{0}, {1}\n'.format(hw_lst[i].firstChild.data, defu_lst[i].firstChild.data))

answered Sep 19, 2016 at 16:01

picmate 涅

4,2875 gold badges50 silver badges52 bronze badges

2 Comments

Powellellogram Over a year ago

This script works up to a point, but renders my computer unusable whilst the script runs. Then I get an ascii error part way through and it doesn't finish the entire file. Thanks though!

picmate 涅 Over a year ago

@Powellellogram, do you have non -ascii characters in your xml file? Then you need to handle them accordingly. When running this program, did you check why your computer freeze? Did you try to debug and step through the code to see what could be wrong? If not, I would recommend you to do that.

Parfait · Accepted Answer · 2016-09-19 16:55:39Z

2

Consider XSLT, the XML transformation language that can manipulate source .xml files to various end use structures including text files like .csv, specifying method="text" in <xsl:output>.

Python's lxml module can run XSLT 1.0 scripts. Below assumes the <entry> tag and its children repeat with different data. And two undeclared namespaces had to be handled in the xsl. Also, XSLT tends to be very efficient on smaller sized XML but varies depending on computer environments.

XSLT Script (save as .xsl to be referenced below)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
               xmlns:ns1="urn:DPS2-metadata" xmlns="urn:COLL">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/ns1:dps-data/ns1:superentry">
   <xsl:text>hw,defunit</xsl:text><xsl:text>&#xa;</xsl:text>
   <xsl:apply-templates select="ns1:entry"/>
</xsl:template>

<xsl:template match="ns1:entry" namespace="urn:COLL">    
   <xsl:value-of select="descendant::ns1:hw" namespace="urn:COLL"/><xsl:text>,</xsl:text>
   <xsl:value-of select="descendant::ns1:defunit" namespace="urn:COLL"/>
   <xsl:text>&#xa;</xsl:text>
</xsl:template>

Pyton Script

import lxml.etree as ET

// LOAD XML AND XSL SOURCES
xml = ET.parse('Input.xml')
xsl = ET.parse('XSLTScript.xsl')

// TRANSFORM SOURCE
transform = ET.XSLT(xsl)
newdom = transform(xml)

// SAVE AS .CSV
with open('Output.csv'), 'wb') as f:
    f.write(newdom)

# hw,defunit
# aa,volcanic rock

edited Sep 19, 2016 at 16:55

answered Sep 19, 2016 at 16:50

Parfait

108k19 gold badges103 silver badges138 bronze badges

2 Comments

Powellellogram Over a year ago

This seems to write every word to the csv file, but I'm having an issue that there are no line breaks or commas for each row or column.

Parfait Over a year ago

Did you run the script? The xslt contains commas and line break entity, &#xa in <xsl:text> tags.

Collectives™ on Stack Overflow

Convert Deeply Nested XML to CSV in Python [duplicate]

3 Answers 3

6 Comments

2 Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

2 Comments

Linked

Related