1

I want to convert the following XML file:

<data>
  <level_1 name="employment">
    <level_2 name="sub-employment">
      <indicator>ind1</indicator>
      <indicator>ind2</indicator>
    </level_2>
    <level_2 name="sub-employment2">
      <indicator>ind3</indicator>
    </level_2>
  </level_1>
  <level_1 name="health">
    <level_2 name="sub-health">
      <level_3 name="sub-sub-health">
        <indicator>ind4</indicator>
      </level_3>
    </level_2>
  </level_1>
</data>

into a Pandas dataframe with a result similar to:

  level_1   level_2         level_3        indicator

0  employment  sub-employment   None             ind1
1  employment  sub-employment   None             ind2    
2  employment  sub-employment2  None             ind3 
3  health      sub-health       sub-sub-health   ind4

I have used the following code after import xml.etree.cElementTree as et and import pandas as pd:

def getvalueofnode(node):
    """ return node text or None """
    return node.text if node is not None else None          
def main():
    """ main """
    parsed_xml = et.parse("tree.xml")
    dfcols = ['level_1', 'level_2', 'level_3', 'indicator']
    df_xml = pd.DataFrame(columns=dfcols)

    for node in parsed_xml.getroot():
        name = node.attrib.get('name')
        level_2 = node.find('level_2')
        level_3 = node.find('level_3')
        indicator = node.find('indicator')

        df_xml = df_xml.append(
            pd.Series([name, getvalueofnode(level_2), getvalueofnode(level_3),
                       getvalueofnode(indicator)], index=dfcols),
            ignore_index=True)     
    print(df_xml)     
main()

but I am getting the wrong result:

      level_1   level_2 level_3 indicator
0  employment  \n          None      None
1      health  \n          None      None

What am I doing wrong here?

3
  • 2
    Please edit to describe how your result is different from your desired result Commented Jan 24, 2020 at 18:46
  • I would recommend reading the following article: ericlippert.com/2014/03/05/how-to-debug-small-programs. Commented Jan 24, 2020 at 22:28
  • I'm not sure if it's related to the issue you're having, but it's best not to repeatedly append to a DataFrame. You can make a list of tuples, a list of lists, etc. and then convert the entire thing to a DataFrame. Commented Jan 24, 2020 at 22:39

1 Answer 1

1

Define the following function, creating a dictionary of ancestors, starting from node upwards:

def parNames(node, root):
    names = {}
    while True:
        node = parentMap[node]
        if node is root:
            return names
        names[node.tag] = node.attrib['name']

It will be needed later. It uses parentMap dictionary, which will be created soon.

Read your input file:

tree = et.parse('tree.xml')
root = tree.getroot()

The actual processing should start from creation of a parent map - a dictionary, that for each node returns its parent:

parentMap = {}
for parent in root.iter():
    for child in parent:
        parentMap[child] = parent

To create source data for your DataFrame, run:

rows = []
for it in root.iter('indicator'):
    row = parNames(it, root)
    row[it.tag] = it.text
    rows.append(row)

This loop creates a list of dictionaries (data for each row). Each row (a dictionary) contains:

  • under iterator key - the text of respective node,
  • under "parent" keys (level_...) name attributes of all parents (returned by parNames function).

The next step is to create the DataFrame:

df2 = pd.DataFrame(rows).fillna('').sort_index(axis=1)

And the only step to do is to move indicator column to the last position:

df2 = df2.reindex(df2.columns.drop('indicator')
    .append(pd.Index(['indicator'])),axis=1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.