Programmatically clean/ignore namespaces in XML - python

Question

I'm trying to write a simple program to read my financial XML files from GNUCash, and learn Python in the process.

The XML looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<gnc-v2
     xmlns:gnc="http://www.gnucash.org/XML/gnc"
     xmlns:act="http://www.gnucash.org/XML/act"
     xmlns:book="http://www.gnucash.org/XML/book"
     {...}
     xmlns:vendor="http://www.gnucash.org/XML/vendor">
<gnc:count-data cd:type="book">1</gnc:count-data>
<gnc:book version="2.0.0">
<book:id type="guid">91314601aa6afd17727c44657419974a</book:id>
<gnc:count-data cd:type="account">80</gnc:count-data>
<gnc:count-data cd:type="transaction">826</gnc:count-data>
<gnc:count-data cd:type="budget">1</gnc:count-data>
<gnc:commodity version="2.0.0">
  <cmdty:space>ISO4217</cmdty:space>
  <cmdty:id>BRL</cmdty:id>
  <cmdty:get_quotes/>
  <cmdty:quote_source>currency</cmdty:quote_source>
  <cmdty:quote_tz/>
</gnc:commodity>

Right now, i'm able to iterate and get results using

import xml.etree.ElementTree as ET 
r = ET.parse("file.xml").findall('.//')

after manually cleaning the namespaces, but I'm looking for a solution that could either read the entries regardless of their namespaces OR remove the namespaces before parsing.

Note that I'm a complete noob in python, and I've read: Python and GnuCash: Extract data from GnuCash files, Cleaning an XML file in Python before parsing and python: xml.etree.ElementTree, removing "namespaces" along with ElementTree docs and I'm still lost...

I've come up with this solution:

def strip_namespaces(self, tree):

    nspOpen = re.compile("<\w*:", re.IGNORECASE)
    nspClose = re.compile("<\/\w*:", re.IGNORECASE)

    for i in tree:
        start = re.sub(nspOpen, '<', tree.tag)          
        end = re.sub(nspOpen, '<\/', tree.tag)

    # pprint(finaltree)
    return

But I'm failing to apply it. I can't seem to be able to retrieve the tag names as they appear on the file.

it is not clear from your question what is your expected output or what kind of data you are trying to extract. — elyase
– elyase, Commented May 20, 2013 at 0:35
I want to either be able to parse the file removing prefixes and namespaces (eg.: <gnc:commodity> becomes <commodity>) or reference the elements ignoring the prefix (eg.: element.findall('book/transaction') for <gnc:book><act:transaction>) — moraleida
– moraleida, Commented May 20, 2013 at 0:43
Try lxml. It's a different XML library for python and understands namespaces. — tdelaney
– tdelaney, Commented May 20, 2013 at 4:31
This answer might help: stackoverflow.com/a/11227304/407651. — mzjn
– mzjn, Commented May 23, 2013 at 5:58
If you want to use python for gnucash, I would recommend exploring my package piecash piecash.readthedocs.io/en/latest. It works with gnucash books saved in one of the SQL formats — sdementen
– sdementen, Commented Nov 26, 2017 at 5:53

Ahito · Accepted Answer · 2018-12-21 10:57:17Z

I think below python code will be helpfull to you.

sample.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<gnc:prodinfo xmlns:gnc="http://www.gnucash.org/XML/gnc"
     xmlns:act="http://www.gnucash.org/XML/act"
     xmlns:book="http://www.gnucash.org/XML/book"
     xmlns:vendor="http://www.gnucash.org/XML/vendor">
    <gnc:change>
        <gnc:lastUpdate>2018-12-21
        </gnc:lastUpdate>
    </gnc:change>
    <gnc:bill>
        <gnc:billAccountNumber>1234</gnc:billAccountNumber>
        <gnc:roles>
            <gnc:id>111111</gnc:id>
            <gnc:pos>2</gnc:pos>
            <gnc:genid>15</gnc:genid>
        </gnc:roles>
    </gnc:bill>
    <gnc:prodtyp>sales and service</gnc:prodtyp>
</gnc:prodinfo>

PYTHON CODE: to remove xmlns for root tag.

import xml.etree.cElementTree as ET

def xmlns(str):
    str1 = str.split('{')
    l=[]
    for i in str1:
        if '}' in i:
            l.append(i.split('}')[1])
        else:
            l.append(i)
    var = ''.join(l)
    return var


tree=ET.parse('sample.xml')
root = tree.getroot()
print(root.tag)   #returns root tag with xmlns as prefix 
print(xmlns(root.tag)) #returns root tag with out xmlns as prefix

Output:

{http://www.gnucash.org/XML/gnc}prodinfo prodinfo

Collectives™ on Stack Overflow

Programmatically clean/ignore namespaces in XML - python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related