1

I'm new to Python and have heard that it is one of the best ways to parse fairly large XML files (150MB). I can't get my head around how to iterate through the tags and extract only the <hw> and <defunit> tags as it's fairly deeply nested.

I have some XML formatted as below, and I need to extract the "hw" and "defunit" tags from it using Python and convert them into a .csv format.

<?xml version="1.0" encoding="UTF-8"?>
<dps-data xmlns="urn:DPS2-metadata" project="SCRABBLELARGE" guid="7d6b7164fde1e064:34368a61:14306b637ab:-8000--4a25ae5c-c104-4c7a-bba5-b434dd4d9314">
    <superentry xmlns="urn:COLL" xmlns:d="urn:COLL" xmlns:e="urn:IDMEE" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-336">
        <entry publevel="1" id="a000001" e:id="u583c10bfdbd326ba.31865a51.12110e76de1.-335">
            <hwblk>
                <hwgrp>
                    <hwunit>
                        <hw>aa</hw>
                        <ulsrc>edsh</ulsrc>
                    </hwunit>
                </hwgrp>
            </hwblk>
            <datablk>
                <gramcat publevel="1" id="a000001.001">
                    <pospgrp>
                        <pospunit>
                            <posp value="noun" />
                        </pospunit>
                    </pospgrp>
                    <sensecat id="a000001.001.01" publevel="1">
                        <defgrp>
                            <defunit>
                                <def>volcanic rock</def>
                            </defunit>
                        </defgrp>
                    </sensecat>
                </gramcat>
            </datablk>
        </entry>
    </superentry>
  </dps-data>

The .csv format I'd like to see it in is simply:

hw, defunit
aa, volcanic rock
3
  • docs.python.org/2/library/xml.etree.elementtree.html Commented Sep 19, 2016 at 15:32
  • What are the repeating elements: <superentry>, <entry>? Usually it's best to post first few repeating items of XML so we understand the tree. Commented Sep 19, 2016 at 16:32
  • @Parfait Apologies, I was only allowed so much code. <superentry> and <entry> are the two repeating elements, yes! Commented Sep 26, 2016 at 8:10

3 Answers 3

3

The lxml library is capable of very powerful XML parsing, and can be used to iterate over an XML tree to search for specific elements.

from lxml import etree

with open(r'path/to/xml', 'r') as xml:
    text = xml.read()
tree = lxml.etree.fromstring(text)
row = ['', '']
for item in tree.iter('hw', 'def'):
    if item.tag == 'hw':
       row[0] = item.text
    elif item.tag == 'def':
       row[1] = item.text

line = ','.join(row)

with open(r'path/to/csv', 'a') as csv:
     csv.write(line + '\n')

How you build the CSV file is largely based upon preference, but I have provided a trivial example above. If there are multiple <dps-data> tags, you could extract those elements first (which can be done with the same tree.iter method shown above), and then apply the above logic to each of them.

EDIT: I should point out that this particular implementation reads the entire XML file into memory. If you are working with a single 150mb file at a time, this should not be a problem, but it's just something to be aware of.

Sign up to request clarification or add additional context in comments.

6 Comments

I get the error "can only concatenate list (not "str") to list" when using this script. I understand I need to convert the value of row to string but am unsure how!
Can you post your full error message? Additionally, I realized that the script should be searching for def, not defunit, as the def tag contains the text that you want.
C:\Users\...\Local\Programs\Python\Python35-32\python.exe C:/Users/.../PycharmProjects/xmltocsv2/xmltocsv.py Traceback (most recent call last): File "C:/Users/mbpowell/PycharmProjects/xmltocsv2/xmltocsv.py", line 18, in <module> csv.write(row + '\n') TypeError: can only concatenate list (not "str") to list is the error message I get.
I have found my mistake and edited the code. The issues was that when I called ','.join(row), I did not assign the result to a variable, meaning that the string produced was never saved for later use. Use the new line variable instead.
Yes I tried that variable fix too. The script then runs with no errors, but the csv file has nothing written to it.
|
2

How about this:

from xml.dom import minidom

xmldoc = minidom.parse('your.xml')
hw_lst = xmldoc.getElementsByTagName('hw')
defu_lst = xmldoc.getElementsByTagName('def')

with open('your.csv', 'a') as out_file:
    for i in range(len(hw_lst)):
        out_file.write('{0}, {1}\n'.format(hw_lst[i].firstChild.data, defu_lst[i].firstChild.data)) 

2 Comments

This script works up to a point, but renders my computer unusable whilst the script runs. Then I get an ascii error part way through and it doesn't finish the entire file. Thanks though!
@Powellellogram, do you have non -ascii characters in your xml file? Then you need to handle them accordingly. When running this program, did you check why your computer freeze? Did you try to debug and step through the code to see what could be wrong? If not, I would recommend you to do that.
2

Consider XSLT, the XML transformation language that can manipulate source .xml files to various end use structures including text files like .csv, specifying method="text" in <xsl:output>.

Python's lxml module can run XSLT 1.0 scripts. Below assumes the <entry> tag and its children repeat with different data. And two undeclared namespaces had to be handled in the xsl. Also, XSLT tends to be very efficient on smaller sized XML but varies depending on computer environments.

XSLT Script (save as .xsl to be referenced below)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
               xmlns:ns1="urn:DPS2-metadata" xmlns="urn:COLL">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" method="text"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/ns1:dps-data/ns1:superentry">
   <xsl:text>hw,defunit</xsl:text><xsl:text>&#xa;</xsl:text>
   <xsl:apply-templates select="ns1:entry"/>
</xsl:template>

<xsl:template match="ns1:entry" namespace="urn:COLL">    
   <xsl:value-of select="descendant::ns1:hw" namespace="urn:COLL"/><xsl:text>,</xsl:text>
   <xsl:value-of select="descendant::ns1:defunit" namespace="urn:COLL"/>
   <xsl:text>&#xa;</xsl:text>
</xsl:template>

Pyton Script

import lxml.etree as ET

// LOAD XML AND XSL SOURCES
xml = ET.parse('Input.xml')
xsl = ET.parse('XSLTScript.xsl')

// TRANSFORM SOURCE
transform = ET.XSLT(xsl)
newdom = transform(xml)

// SAVE AS .CSV
with open('Output.csv'), 'wb') as f:
    f.write(newdom)

# hw,defunit
# aa,volcanic rock

2 Comments

This seems to write every word to the csv file, but I'm having an issue that there are no line breaks or commas for each row or column.
Did you run the script? The xslt contains commas and line break entity, &#xa in <xsl:text> tags.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.