Parsing an XML File to CSV without hardcoding values

Question

I was wondering if there is a way to parse through an XML and basically get all the tags (or as much as possible) and put them into columns without hardcoding.

For example the eventType tag in my xml. I would like it to initially create a column named "eventType" and put the value inside it underneath that column. Each "eventType" tag it parses through would be put it into the same column.

Here is generally how I am trying to make it look like:

Here is the XML sample:

<?xml version="1.0" encoding="UTF-8"?>

<faults version="1" xmlns="urn:nortel:namespaces:mcp:faults" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:nortel:namespaces:mcp:faults NortelFaultSchema.xsd ">
    <family longName="1OffMsgr" shortName="OOM"/>
    <family longName="ACTAGENT" shortName="ACAT">
        <logs>
           <log>
                <eventType>RES</eventType>
                <number>1</number>
                <severity>INFO</severity>
                <descTemplate>
                     <msg>Accounting is enabled upon this NE.</msg>
               </descTemplate>
               <note>This log is generated when setting a Session Manager's AM from &lt;none&gt; to a valid AM.</note>
               <om>On all instances of this Session Manager, the &lt;NE_Inst&gt;:&lt;AM&gt;:STD:acct OM row in the  StdRecordStream group will appear and start counting the recording units sent to the configured AM.
                   On the configured AM, the &lt;NE_inst&gt;:acct OM rows in RECSTRMCOLL group will appear and start counting the recording units received from this Session Manager's instances.
               </om>
            </log>
           <log>
                <eventType>RES</eventType>
                <number>2</number>
                <severity>ALERT</severity>
                <descTemplate>
                     <msg>Accounting is disabled upon this NE.</msg>
               </descTemplate>
               <note>This log is generated when setting a Session Manager's AM from a valid AM to &lt;none&gt;.</note>
               <action>If you do not intend for the Session Manager to produce accounting records, then no action is required.  If you do intend for the Session Manager to produce accounting records, then you should set the Session Manager's AM to a valid AM.</action>
               <om>On all instances of this Session Manager, the &lt;NE_Inst&gt;:&lt;AM&gt;:STD:acct OM row in the StdRecordStream group that matched the previous datafilled AM will disappear.
                   On the previously configured AM, the  &lt;NE_inst&gt;:acct OM rows in RECSTRMCOLL group will disappear.
               </om>
            </log>
        </logs>
    </family>
    <family longName="ACODE" shortName="AC">
        <alarms>
            <alarm>
                <eventType>ADMIN</eventType>
                <number>1</number>
                <probableCause>INFORMATION_MODIFICATION_DETECTED</probableCause>
                <descTemplate>
                    <msg>Configured data for audiocode server updated: $1</msg>
                     <param>
                         <num>1</num>
                         <description>AudioCode configuration data got updated</description>
                         <exampleValue>acgwy1</exampleValue>
                     </param>
               </descTemplate>
               <manualClearable></manualClearable>
               <correctiveAction>None. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
               <alarmName>Audiocode Server Updated</alarmName>
               <severities>
                     <severity>MINOR</severity>
               </severities>               
            </alarm>
            <alarm>
                <eventType>ADMIN</eventType>
                <number>2</number>
                <probableCause>CONFIG_OR_CUSTOMIZATION_ERROR</probableCause>
                <descTemplate>
                    <msg>Deployment for audiocode server failed: $1. Reason: $2.</msg>
                     <param>
                         <num>1</num>
                         <description>AudioCode Name</description>
                         <exampleValue>audcod</exampleValue>
                     </param>
                     <param>
                         <num>2</num>
                         <description>AudioCode Deployment failed reason</description>
                         <exampleValue>Failed to parse audiocode configuration data</exampleValue>
                     </param>
               </descTemplate>
               <manualClearable></manualClearable>
               <correctiveAction>Check the configuration of audiocode server. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
               <alarmName>Audiocode Server Deploy Failed</alarmName>
               <severities>
                     <severity>MINOR</severity> 
                     <severity>MAJOR</severity>
               </severities>               
            </alarm>
            <alarm>
                <eventType>COMM</eventType>
                <number>2</number>
                <probableCause>LOSS_OF_FRAME</probableCause>
                <descTemplate>
                    <msg>Far end LOF (a.k.a., Yellow Alarm). Trunk (DS1 Number): $1.</msg>
                     <param>
                         <num>1</num>
                         <description>Trunk Number of Trunk with configuration problem</description>
                         <exampleValue>2</exampleValue>
                     </param>
               </descTemplate>
               <clearCondition>Far end is correctly configured for proper framing.</clearCondition>
               <correctiveAction>Check that the far end is configured for the proper framing.</correctiveAction>
               <alarmName>Far end LOF</alarmName>
               <severities>
                     <severity>CRITICAL</severity>
               </severities>
               <note>This alarm indicates the Trunk Framing settings on the connected PSTN switch do not match those provisioned on the Audiocodes Mediant 2k.</note>
            </alarm>
            <alarm>
                <eventType>COMM</eventType>
                <number>3</number>
                <probableCause>LOSS_OF_FRAME</probableCause>
                <descTemplate>
                    <msg>Near end sending LOF Indication. Trunk (DS1 Number): $1.</msg>
                     <param>
                         <num>1</num>
                         <description>Trunk Number of Trunk with configuration problem</description>
                         <exampleValue>2</exampleValue>
                     </param>
               </descTemplate>
               <clearCondition>Gateway is correctly configured for proper framing.</clearCondition>
               <correctiveAction>Check that the Audiocodes gateway is configured for the proper framing.</correctiveAction>
               <alarmName>Near end sending LOF Indication</alarmName>
               <severities>
                     <severity>CRITICAL</severity>
               </severities>               
            </alarm>
        </alarms>
    </family>
</faults>

This is the code, as you can see my tag names are hardcoded:

from xml.etree import ElementTree
import csv
import lxml.etree
import pandas as pd
from copy import copy
from pprint import pprint


tree = ElementTree.parse('FaultFamilies.xml')


sitescope_data = open('Out.csv', 'w', newline='', encoding='utf-8')
csvwriter = csv.writer(sitescope_data)

# Create all needed columns here in order and writes them to excel file
col_names = ['longName', 'shortName', 'eventType', 'ProbableCause', 'Severity', 'alarmName', 'clearCondition',
             'correctiveAction', 'note', 'action', 'om']
csvwriter.writerow(col_names)



def recurse(root, props):

    # Finds every single tag in the xml file
    for child in root:
        #print(child.text)
        if child.tag == '{urn:nortel:namespaces:mcp:faults}family':
            # copy of the dictionary
            p2 = copy(props)

            # adds to the dictionary the longNm name and shortName
            p2['longName'] = child.attrib.get('longName', '')
            p2['shortName'] = child.attrib.get('shortName', '')
            recurse(child, p2)
        else:
            recurse(child, props)

    # FIND ALL NEEDED ALARMS INFORMATION
    for event in root.findall('{urn:nortel:namespaces:mcp:faults}alarm'):

        event_data = [props.get('longName',''), props.get('shortName', '')]

        # Find eventType and appends it
        event_id = event.find('{urn:nortel:namespaces:mcp:faults}eventType')
        if event_id != None:
            event_id = event_id.text
        # appends to the to the list with comma
        event_data.append(event_id)

        # Find probableCause and appends it
        probableCause = event.find('{urn:nortel:namespaces:mcp:faults}probableCause')
        if probableCause != None:
            probableCause = probableCause.text
        event_data.append(probableCause)

        # Find severities and appends it
        severities = event.find('{urn:nortel:namespaces:mcp:faults}severities')
        if severities:
            severity_data = ','.join(
                [sv.text for sv in severities.findall('{urn:nortel:namespaces:mcp:faults}severity')])
            event_data.append(severity_data)
        else:
            event_data.append("")

        # Find alarmName and appends it
        alarmName = event.find('{urn:nortel:namespaces:mcp:faults}alarmName')
        if alarmName != None:
            alarmName = alarmName.text
        event_data.append(alarmName)

        clearCondition = event.find('{urn:nortel:namespaces:mcp:faults}clearCondition')
        if clearCondition != None:
            clearCondition = clearCondition.text
        event_data.append(clearCondition)

        correctiveAction = event.find('{urn:nortel:namespaces:mcp:faults}correctiveAction')
        if correctiveAction != None:
            correctiveAction = correctiveAction.text
        event_data.append(correctiveAction)

        note = event.find('{urn:nortel:namespaces:mcp:faults}note')
        if note != None:
            note = note.text
        event_data.append(note)

        action = event.find('{urn:nortel:namespaces:mcp:faults}action')
        if action != None:
            action = action.text
        event_data.append(action)

        csvwriter.writerow(event_data)

    # FIND ALL LOGS INFORMATION
    for event in root.findall('{urn:nortel:namespaces:mcp:faults}log'):
        event_data = [props.get('longName', ''), props.get('shortName', '')]

        event_id = event.find('{urn:nortel:namespaces:mcp:faults}eventType')
        if event_id != None:
            event_id = event_id.text
        event_data.append(event_id)

        probableCause = event.find('{urn:nortel:namespaces:mcp:faults}probableCause')
        if probableCause != None:
            probableCause = probableCause.text
        event_data.append(probableCause)

        severities = event.find('{urn:nortel:namespaces:mcp:faults}severity')
        if severities != None:
            severities = severities.text
        event_data.append(severities)

        alarmName = event.find('{urn:nortel:namespaces:mcp:faults}alarmName')
        if alarmName != None:
            alarmName = alarmName.text
        event_data.append(alarmName)

        # Find alarmName and appends it
        clearCondition = event.find('{urn:nortel:namespaces:mcp:faults}clearCondition')
        if clearCondition != None:
            clearCondition = clearCondition.text
        event_data.append(clearCondition)

        correctiveAction = event.find('{urn:nortel:namespaces:mcp:faults}correctiveAction')
        if correctiveAction != None:
            correctiveAction = correctiveAction.text
        event_data.append(correctiveAction)

        note = event.find('{urn:nortel:namespaces:mcp:faults}note')
        if note != None:
            note = note.text
        event_data.append(note)

        action = event.find('{urn:nortel:namespaces:mcp:faults}action')
        if action != None:
            action = action.text
        event_data.append(action)
        csvwriter.writerow(event_data)


root = tree.getroot()
recurse(root, {})  # root + empty dictionary
print("File successfuly converted to CSV")
sitescope_data.close()

When running @tdelaney solution:

Why did you copy paste all those blocks with alarmName? You can loop over the names it has to look for, right? — Robin De Schepper
– Robin De Schepper, Commented Oct 12, 2020 at 18:32
Yes, this was just a test. I'll fix it up for sure if hardcoding is the only way to parse. I am trying to find a way to get all the tags into columns without hardcoding as this xml will change overtime with new tags. — marcorivera8
– marcorivera8, Commented Oct 12, 2020 at 18:38
I'm sadly not familiar with the library you're using here, but I see that you are recursing over the nodes, I think that's enough to keep a set() of all the unique values you encounter, right? — Robin De Schepper
– Robin De Schepper, Commented Oct 12, 2020 at 19:37
Aargh you've completely changed the XML since I looked last night and started trying to code a solution. — DisappointedByUnaccountableMod
– DisappointedByUnaccountableMod, Commented Oct 18, 2020 at 11:02
hey @barny this is an older question. Please see this one: stackoverflow.com/questions/64407201/… — marcorivera8
– marcorivera8, Commented Oct 18, 2020 at 16:52

tdelaney · Accepted Answer · 2020-10-13 17:24:41Z

2

You could build a list of lists to represent rows of the table. Whenever its time for a new row, build a new list with all known columns defaulted to "" and append it to the bottom of the outer list. When a new column needs to inserted, its just a case of spinning through the existing inner lists and appending a default "" cell. Keep a map of known column names to index in the row. Now when you spin through the events, you use the tag name to find the row index and add its value to the latest row in the table.

It looks like you want "log" and "alarm" tags, but I wrote the element selector to take any element that has an "eventType" child element. Since "longName" and "shortName" are common to all events under a given , there is an outer loop to grab those and apply on each new row of the table. I switched to xpath so that I could setup namespaces and write the selectors more tersely. Personal preference there, but I think it makes the xpath more readable.

import csv
import lxml.etree
from lxml.etree import QName
import operator

class ExpandingTable:
    """A 2 dimensional table where columns are exapanded as new column
    types are discovered"""

    def __init__(self):
        """Create table that can expand rows and columns"""
        self.name_to_col = {}
        self.table = []
    
    def add_column(self, name):
        """Add column named `name` unless already included"""
        if name not in self.name_to_col:
            self.name_to_col[name] = len(self.name_to_col)
            for row in self.table:
                row.append('')
    
    def add_cell(self, name, value):
        """Add value to named column in the current row"""
        if value:
            self.add_column(name)
            self.table[-1][self.name_to_col[name]] = value.strip().replace("\r\n", " ")
            
    def new_row(self):
        """Create a new row and make it current"""
        self.table.append([''] * len(self.name_to_col))

    def header(self):
        """Gather discovered column names into a header list"""
        idx_1 = operator.itemgetter(1)
        return [name for name, _ in sorted(self.name_to_col.items(), key=idx_1)]

    def prepend_header(self):
        """Gather discovered column names into a header and
        prepend it to the list"""
        self.table.insert(0, self.header())

def events_to_table(elem):
    """ Builds table from <family> child elements and their contained alarms and
    logs."""
    ns = {"f":"urn:nortel:namespaces:mcp:faults"}
    table = ExpandingTable()
    for family in elem.xpath("f:family", namespaces=ns):
        longName = family.get("longName")
        shortName = family.get("shortName")
        for event in family.xpath("*/*[f:eventType]", namespaces=ns):
            table.new_row()
            table.add_cell("longName", longName)
            table.add_cell("shortName", shortName)
            for cell in event:
                tag = QName(cell.tag).localname
                if tag == "severities":
                    tag = "severity"
                    text = ",".join(severity.text for severity in cell.xpath("*"))
                    print("severities", repr(text))
                else:
                    text = cell.text
                table.add_cell(tag, text)
    table.prepend_header()
    return table.table
    
def main(filename):
    doc = lxml.etree.parse(filename)
    table = events_to_table(doc.getroot())
    with open('test.csv', 'w', newline='', encoding='utf-8') as fileobj:
        csv.writer(fileobj).writerows(table)

main('test.xml')

edited Oct 13, 2020 at 17:24

answered Oct 12, 2020 at 20:51

tdelaney

78k6 gold badges91 silver badges129 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

tdelaney Over a year ago

Interesting. It loaded fine on libreoffice on linux. I don't have excel so can't test it directly. The lines terminate with \r\n and there are terminators in the text itself too. They should be escaped and the csv reader should figure it out, but... I added a bit of code to scrub the value by removin embedded newlines before its added. Does that work better?

tdelaney Over a year ago

Could be a rookie mistake on my part. I didn't add newline=None when opening the file. On windows that can cause the newlines to be odd things like "\r\r\n" confusing all.

tdelaney Over a year ago

And that's after changing to open('test.csv', 'w', newline=None, encoding='utf-8') ? csv.writer defaults to the Excel dialect which should make Excel happy. Not sure what the issue is. Maybe encoding='utf-8-sig' to add a BOM on windows would help.

tdelaney Over a year ago

Oh, right! I thought newline=None does the same thing but actually that's the default of \r\n on windows. Yours is the right way and I'll fix the example.

tdelaney Over a year ago

Updated to grab the names.

|

Collectives™ on Stack Overflow

Parsing an XML File to CSV without hardcoding values

1 Answer 1

18 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

18 Comments

Your Answer

Sign up or log in

Post as a guest

Related