1

I want to turn the following data into a dataframe, de xml file looks like this:


<?xml version="1.0" encoding="utf-8"?>
<file>
  <SORT_INFO>
    <sort_type>sort order</sort_type>
  </SORT_INFO>
  <ALL_INSTANCES>
    <instance>
      <ID>1</ID>
      <start>11.24</start>
      <end>31.24</end>
      <code>19. Luis Alehandru Paz Mulato</code>
      <label>
        <group>Team</group>
        <text>America de Cali</text>
      </label>
      <label>
        <group>Action</group>
        <text>Passes accurate</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>52.6</pos_x>
      <pos_y>34.2</pos_y>
    </instance>
    <instance>
      <ID>2</ID>
      <start>11.24</start>
      <end>31.24</end>
      <code>19. Luis Alehandru Paz Mulato</code>
      <label>
        <group>Team</group>
        <text>America de Cali</text>
      </label>
      <label>
        <group>Action</group>
        <text>Positional attacks</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>52.6</pos_x>
      <pos_y>34.2</pos_y>
    </instance>
    <instance>
      <ID>3</ID>
      <start>14.43</start>
      <end>34.43</end>
      <code>6. Felipe Jaramillo Velasquez</code>
      <label>
        <group>Team</group>
        <text>America de Cali</text>
      </label>
      <label>
        <group>Action</group>
        <text>Positional attacks</text>
      </label>
      <label>
        <group>Half</group>
        <text>1st half</text>
      </label>
      <pos_x>38.6</pos_x>
      <pos_y>26.2</pos_y>
    </instance>
  </ALL_INSTANCES>

After </ALL_INSTANCES> there is a little bit more, but I only need the data inside the 'instance' tags as if those were the rows. So my columns would be as follows:

['ID', 'start', 'end', 'code', 'group1', 'text1', 'group2', 'text2', 'group3', 'text3', 'pos_x','pos_y']

and the first row would be:

['1', '11.24', '31.24', '19. Luis Alehandru Paz Mulato', 'Team', 'America de Cali', 'Action', 'Passes accurate', 'Half', '1st half', '52.6','34.2']

It is my firts time working with xml files, thank you so much for your help.

I already tried the following:

import xml.etree.ElementTree as et 

xtree = et.parse("players_vs_nacional_0-3.xml")
xroot = xtree.getroot()
    

for node in xroot:         
    id_action = node.find("ID").text   
    start = node.find("start").text 
    end = node.find("end").text 
    code = node.find("code").text 
    posx = node.find("pos_x").text 
    posy = node.find("pos_y").text 

But I got the error:

AttributeError: 'NoneType' object has no attribute 'text'

I believe it is because the first nodes don't contain what I was looking for ("ID", "start", "end", etc), but I couldn't get passed that and I am also having trouble trying to figure out how to get the info out of the tags because there are three of them for every and I need them all and in the same row.

1
  • Just updated my question trying to be a little bit more specific, hope that helps Commented Dec 5, 2020 at 22:51

2 Answers 2

1

I believe you may be looking for something like this:

from lxml import etree
import pandas as pd
inst = """[your xml above, corrected; it's missing a closing </file> tag]"""
doc = etree.XML(inst.encode())

rows =[]
targets = doc.xpath('//instance')
columns = [elem.tag for elem in targets[0].xpath('.//*') if len(elem.getchildren())==0 ]

for target in targets:
    row = []
    for item in target.xpath('.//*'):            
        if len(item.getchildren())>0:
            continue
        else:
            row.append(item.text.strip())
    rows.append(row)
df = pd.DataFrame(rows,columns=columns)
df

Output:

ID  start   end     code    group   text    group   text    group   text    pos_x   pos_y
0   1   11.24   31.24   19. Luis Alehandru Paz Mulato   Team    America de Cali     Action  Passes accurate     Half    1st half    52.6    34.2
1   2   11.24   31.24   19. Luis Alehandru Paz Mulato   Team    America de Cali     Action  Positional attacks  Half    1st half    52.6    34.2
2   3   14.43   34.43   6. Felipe Jaramillo Velasquez   Team    America de Cali     Action  Positional attacks  Half    1st half    38.6    26.2
Sign up to request clarification or add additional context in comments.

Comments

1

Consider a nested list/dictionary comprehension with enumerate to number the multiple label nodes. Then pass liat of dictionaries into DataFrame constructor. Below merges dictionaries using latest method available in Python 3.5+:

data = [{ **{n.tag:n.text for n in inst},    
          **{l.tag+str(i+1):l.text for i,l in enumerate(inst.findall("label/*"))} }
           for inst in xroot.findall(".//instance") ]
           
df = pd.DataFrame(data)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.