Python parse XML file into pandas dataframe

Question

I have the below xml structure and I am trying to convert the xml data into a structured pandas dataframe. I have read a number of stackoverflow posts about xml conversion using both xml.etree.ElementTree and BeautifulSoup but none seem to handle the example where I want not just tags, attributes or text but really all of them.

for example, what I am hoping to obtain from the below xml is columns like:

abr_record_last_updated_date, abr_replaced, abn_status, abn_status_from_date, abn

You will see in the above abn is actual text and I am just not sure of how to collect it all.

<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>


<ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>



<ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>



</Transfer>

I started going down the path of using root.iter on each of the items but I couldn't work out how I would use that logic to get all the columns I want.

import xml.etree.ElementTree as et
root = et.parse('sample.xml').getroot()

dict_new = {}

for each in root.iter('ABN'):

    #abr_last_updated_date = 
    print(each.tag)
    print(each.attrib)
    print(each.items())
    print(each.text)

Ultimately if someone can share how to iterate over each xml "block" (not sure of the correct term) and obtain the first few colums I am sure I can work out the rest.

It can be done using lxml, if that's available to you.

Jack Fleeting
– Jack Fleeting

2019-06-19 12:50:04 +00:00
Commented Jun 19, 2019 at 12:50 — Jack Fleeting
– Jack Fleeting, Commented Jun 19, 2019 at 12:50

Andrej Kesely · Accepted Answer · 2019-06-19 13:37:45Z

Even if this is XML file, you can use CSS selectors of BeautifulSoup or text property:

data = '''<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>


<ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>



<ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>



</Transfer>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'xml')

z = zip(soup.select('ABR[recordLastUpdatedDate]'),
    soup.select('ABR[replaced]'),
    soup.select('ABN[status]'),
    soup.select('ABN[ABNStatusFromDate]'),
    soup.select('ABN'))

for (c1, c2, c3, c4, c5) in z:
    print(c1['recordLastUpdatedDate'], c2['replaced'], c3['status'], c4['ABNStatusFromDate'], c5.text.strip())

Prints:

20180216 N ACT 19991101 11000000948
20190531 N CAN 20190501 11000002568

Very nicely done! Greatly appreciated, this is more than enough for me to work out the rest. Thank you, I now know something new :)

KunduK · Accepted Answer · 2019-06-19 13:38:02Z

Using BeautifulSoup you can fetch all the items.

Tag
Tag Text
Attribute Name
Attribute value

    from bs4 import BeautifulSoup

    data='''<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>


    <ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>



    <ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>



    </Transfer>'''

    soup=BeautifulSoup(data,'lxml')
    for tag in soup.select('ABN'):
        print("Tag:" + str(tag))
        print("Tag Text " + tag.text)
        for attr in tag.attrs:
            print("Attribute name : "+ attr)
            print("Attribute value : " + tag[attr])

Output Printed on console.

Tag:<abn abnstatusfromdate="19991101" status="ACT">11000000948</abn>
Tag Text 11000000948
Attribute name : abnstatusfromdate
Attribute value : 19991101
Attribute name : status
Attribute value : ACT
Tag:<abn abnstatusfromdate="20190501" status="CAN">11000002568</abn>
Tag Text 11000002568
Attribute name : abnstatusfromdate
Attribute value : 20190501
Attribute name : status
Attribute value : CAN

Collectives™ on Stack Overflow

Python parse XML file into pandas dataframe

2 Answers 2

1 Comment

Output Printed on console.

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Output Printed on console.

Comments

Your Answer

Sign up or log in

Post as a guest

Related