Parsing XML with Python: How to Make Sibling Tags into Children Tags?

Question

I want to extract the name and d tags for each food item from the xml file.

I thought about making all the d tags to become children of name tag. And then looping over the contents of name. But not sure how to go about that or if there are other more efficient ways. Open to other solutions. I have some code but not there yet. Thank you!

## XML 

<?xml version="1.0"?>
<breakfast_menu>
    <food>
        <name>Belgian Waffles</name>
        <d>price 5.95</d>
        <d>Two of our famous Belgian Waffles 
with plenty of real maple syrup</d>
        <d>650 cal</d>
        <name>Belgian Waffles Light</name>
        <d>price 5.15</d>
        <d>Two of our famous Belgian Waffles with less calories</d>
        <d>450 cal</d> 
    </food>
    <food>
        <name>Strawberry Belgian Waffles</name>
        <d>price 7.95</d>
        <d>Light Belgian waffles covered 
with strawberries and whipped cream</d>
        <d>900 cal</d>
    </food>
    <food>
        <name>French Toast</name>
        <d>price 4.50</d>
        <d>Thick slices made from our 
homemade sourdough bread</d>
        <d>600 cal</d>
    </food>
</breakfast_menu>

## My code

import xml.etree.ElementTree as ET
import pandas as pd
  
tree = ET.parse('xml_doc_txt.txt')
root = mytree.getroot()

[elem.tag for elem in root.iter()]

for node in root.iter('food'):
    for name in node.findall('name'):
        Name = name.text
    for d in node.findall('d'):
        description = node.findtext('d')       
        action = action.append(pd.DataFrame(data={'Name': Name, 'Description': description}, index = [0]), ignore_index = True)

df = pd.DataFrame(action, columns=['Name', 'Description'])
df

The desired df should have 2 columns like so:


| Name             |         Description    |
| -----------------| --------------------------------------------- |
| Belgian Waffles  | price 5.95,Two of our famous..., 650 cal|
| Belgian Waffles Light  | price 5.15, Two of our famous..., 450 cal|          
| Strawberry Belgian Waffles | price 7.95,Light Belgian waffles..., 900 cal|                              
...

Are you sure you want 2 and not 3 columns (name, price,description)? Also, it's probably easier with lxml instead of ElementTree, if you have it installed. — Jack Fleeting
– Jack Fleeting, Commented Jun 8, 2021 at 16:38
as long as it gets me to the result, happy to use that. I hadn't heard of that library. — Adri
– Adri, Commented Jun 8, 2021 at 16:43

Jack Fleeting · Accepted Answer · 2021-06-08 16:45:40Z

1

Using lxml:

from lxml import etree
menu = """your xml above"""
root = etree.fromstring(menu)
for item in items:    
    rows.append([item.text,item.xpath('./following-sibling::d[1]/text()')[0]+" "+item.xpath('./following-sibling::d[2]/text()')[0]])

pd.DataFrame(rows,columns=columns)

Output (sorry about the formatting):

    name    desc
0   Belgian Waffles     price 5.95 Two of our famous Belgian Waffles \...
1   Belgian Waffles Light   price 5.15 Two of our famous Belgian Waffles w...
2   Strawberry Belgian Waffles  price 7.95 Light Belgian waffles covered \nwit...
3   French Toast    price 4.50 Thick slices made from our \nhomema...

answered Jun 8, 2021 at 16:45

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

balderman · Accepted Answer · 2021-06-08 14:55:27Z

0

The below should work

import pandas as pd
import xml.etree.ElementTree as ET

xml = '''<breakfast_menu>
    <food>
        <name>Belgian Waffles</name>
        <d>price 5.95</d>
        <d>Two of our famous Belgian Waffles 
with plenty of real maple syrup</d>
        <d>650 cal</d>
    </food>
    <food>
        <name>Strawberry Belgian Waffles</name>
        <d>price 7.95</d>
        <d>Light Belgian waffles covered 
with strawberries and whipped cream</d>
        <d>900 cal</d>
    </food>
    <food>
        <name>French Toast</name>
        <d>price 4.50</d>
        <d>Thick slices made from our 
homemade sourdough bread</d>
        <d>600 cal</d>
    </food>
</breakfast_menu>'''

root = ET.fromstring(xml)
data = []
for food in root.findall('.//food'):
    data.append({'name': food.find('name').text, 'description': ','.join([d.text for d in food.findall('d')])})
df = pd.DataFrame(data)
print(df)

output

                         name                                        description
0             Belgian Waffles  price 5.95,Two of our famous Belgian Waffles \...
1  Strawberry Belgian Waffles  price 7.95,Light Belgian waffles covered \nwit...
2                French Toast  price 4.50,Thick slices made from our \nhomema...

answered Jun 8, 2021 at 14:55

balderman

24k8 gold badges39 silver badges60 bronze badges

4 Comments

Adri Over a year ago

Hi, thank you for your answer. I updated the XML with a more representative sample of the doc I'm working with. In this case, I updated with the case of having 2 different foods under the same food tag.

balderman Over a year ago

Can't see any difference in the xml

Adri Over a year ago

check out the first food tag, you'll see that there are 2 options under the same food tag now. thanks!

balderman Over a year ago

see the lxml based answer

gaurav · Accepted Answer · 2021-06-09 05:49:32Z

0

your code as some naming error. you don't have to use findall every time like name is only one time . action is not define but you are still appending it , this code generate your desire output of df

import xml.etree.ElementTree as ET
import pandas as pd
  
tree = ET.parse('xml_doc_txt.txt')
root = tree.getroot()


breakfast_lst = []
descriptions_lst = []

for node in root.iter('food'):
    breakfasts = node.findall('name')
    descriptions =  node.findall('d')
    d_tag_num = 3 #i assuming there are always three tag

    for i,breakfast_name in enumerate(breakfasts):
        breakfast_lst.append(breakfast_name.text)
        full_description = ', '.join([descrpt.text for descrpt in descriptions[i*d_tag_num:(i*d_tag_num)+d_tag_num]])
        descriptions_lst.append(full_description)


df = pd.DataFrame(data={'Name':breakfast_lst,'Description':descriptions_lst})
print(df)

edited Jun 9, 2021 at 5:49

answered Jun 8, 2021 at 14:51

gaurav

1,3351 gold badge14 silver badges25 bronze badges

1 Comment

Adri Over a year ago

Thank you for your answer. I've updated the XML with a more representative sample of my data. Could you update your response?

Collectives™ on Stack Overflow

Parsing XML with Python: How to Make Sibling Tags into Children Tags?

3 Answers 3

Comments

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related