How to parse nested XML and extract attributes + tag text both?

Question

My XML looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" new_id="12">
        <level>1&amp;1</level>
        <typ>Green</typ>
        <name>Alpha</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
    <offer id="12" new_id="31">
        <level>1&amp;1</level>
        <typ>Yellow</typ>
        <name>Beta</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
</details>
</main_heading>

I want to parse certain fields into a dataframe.

Expected Output

timestamp   id     new_id   level      name
20220113    11     12       1&amp;1    Alpha
20220113    12     31       1&amp;1    Beta

where NAME nested within the "visits" tag is not included. I just want to consider the outer "name" tag.

timestamp = soup.find('main_heading').get('timestamp')
df[timestamp'] = timestamp

this solves one part

The rest I can do like this:

typ = []
for i in (soup.find_all('typ')):
    typ.append(i.text)

but i don't want to create several for loops for every new field

expected output is given in the qs above. A dataframe. @eike — x89
– x89, Commented Jan 23, 2023 at 13:38
expected output, yes, but not the constraints for the algorithm. you don't want to use for loops at all? — eike
– eike, Commented Jan 23, 2023 at 13:39
I am open to suggestions but i am hoping for something where i don't have to create a new long loop for each field (just in case i have too many fields to extract) if possible @eike — x89
– x89, Commented Jan 23, 2023 at 13:40
If you are only interested in single subfields of offer, would one loop over all offers be acceptable? — eike
– eike, Commented Jan 23, 2023 at 13:41

HedgeHog · Accepted Answer · 2023-01-23 13:58:03Z

3

Iterate over the offers and select its previous main_heading:

for e in soup.select('offer'):
    data.append({
        'timestamp': e.find_previous('main_heading').get('timestamp'),
        'id':e.get('id'),
        'id_old':e.get('old_id'),
        'level':e.level.text,
        'typ':e.typ.text,
        'name':e.select_one('name').text
    })

Or in alternative to exclude only some elements and be more generic:

for e in soup.select('offer'):
    
    d = {
        'timestamp': e.find_previous('main_heading').get('timestamp'),
        'id':e.get('id'),
        'id_old':e.get('old_id'),
    }

    d.update({c.name:c.text for c in e.children if c.name is not None and 'visits' not in c.name})

    data.append(d)

Example

from bs4 import BeautifulSoup
import pandas as pd

xml = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" new_id="12">
        <level>1&amp;1</level>
        <typ>Green</typ>
        <name>Alpha</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
    <offer id="12" new_id="31">
        <level>1&amp;1</level>
        <typ>Yellow</typ>
        <name>Beta</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
</details>
</main_heading>
'''
soup = BeautifulSoup(xml,'xml')

data = []

for e in soup.select('offer'):
    data.append({
        'timestamp': e.find_previous('main_heading').get('timestamp'),
        'id':e.get('id'),
        'id_old':e.get('old_id'),
        'level':e.level.text,
        'typ':e.typ.text,
        'name':e.select_one('name').text
    })

pd.DataFrame(data)

Output

	timestamp	id	id_old	level	typ	name
0	20220113	11		1&1	Green	Alpha
1	20220113	12		1&1	Yellow	Beta

edited Jan 23, 2023 at 13:58

answered Jan 23, 2023 at 13:43

HedgeHog

25.4k5 gold badges18 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

eike Over a year ago

Just out of interest, in this situation, is there a difference between select("offer") and find_all("offer")?

HedgeHog Over a year ago

Not in this specific case, cause both use the elements name, but in general the fact that select uses css selectors -> crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

HedgeHog Over a year ago

@eike In addition, may check this q/a: stackoverflow.com/questions/38028384/…

balderman Over a year ago

@HedgeHog I think xml.etree.ElementTree is good enough here.

jqurious · Accepted Answer · 2023-01-23 14:17:53Z

2

pandas has .read_xml()

You can use xpath= to pass custom XPath expressions to specify what to extract.

For example, <offer> and <main_heading> tags:

>>> pd.read_xml("main.xml", xpath="""//*[name() = "offer" or name() = "main_heading"]""")
    timestamp  details    id  new_id level     typ   name  visits
0  20220113.0      NaN   NaN     NaN  None    None   None     NaN
1         NaN      NaN  11.0    12.0   1&1   Green  Alpha     NaN
2         NaN      NaN  12.0    31.0   1&1  Yellow   Beta     NaN

From there you could .ffill() the timestamp and drop the details/visits columns:

>>> (pd.read_xml("main.xml", xpath="""//*[name() = "offer" or name() = "main_heading"]""")
...    .ffill()
...    .drop(columns=["details", "visits"])
...    .dropna()
... )
    timestamp    id  new_id level     typ   name
1  20220113.0  11.0    12.0   1&1   Green  Alpha
2  20220113.0  12.0    31.0   1&1  Yellow   Beta

answered Jan 23, 2023 at 14:17

jqurious

24.4k6 gold badges24 silver badges44 bronze badges

2 Comments

HedgeHog Over a year ago

Never thought of using the condition in the xpath, top alternative to work directly with pandas

Jack Fleeting Over a year ago

Looks like we posted a few seconds apart - using the same idea! Great minds think alike (if I say so myself)!

balderman · Accepted Answer · 2023-01-23 14:04:07Z

1

No need for any external library.

Core python is enough here.

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" new_id="12">
        <level>1&amp;1</level>
        <typ>Green</typ>
        <name>Alpha</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
    <offer id="12" new_id="31">
        <level>1&amp;1</level>
        <typ>Yellow</typ>
        <name>Beta</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
</details>
</main_heading>'''

data = []
root = ET.fromstring(xml)
timestamp = root.attrib.get('timestamp')
for offer in root.findall('.//offer'):
    temp = {'timestamp': timestamp}
    for attr in ['id', 'new_id']:
        temp[attr] = offer.attrib.get(attr)
    for ele in ['level', 'name']:
        temp[ele] = offer.find(ele).text
    data.append(temp)
df = pd.DataFrame(data)
print(df)

output

  timestamp  id new_id level   name
0  20220113  11     12   1&1  Alpha
1  20220113  12     31   1&1   Beta

answered Jan 23, 2023 at 14:04

balderman

24k8 gold badges39 silver badges60 bronze badges

1 Comment

HedgeHog Over a year ago

plausible alternative, will keep it in mind to keep it simpler if necessary.

Jack Fleeting · Accepted Answer · 2023-01-23 14:17:04Z

1

For the sake of completeness (and future visitors) here's another one: since we're dealing with xml and the final output is a dataframe - it's probably best (and simplest) to use pandas.read_xml:

df = pd.read_xml(xml,xpath='//offer')
ts = pd.read_xml(xml,xpath="//main_heading")['timestamp'][0]
df.insert(0, 'timestamp', ts)
print(df.drop(['typ', 'visits'], axis=1))

And that should get you your expected output.

answered Jan 23, 2023 at 14:17

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

3 Comments

HedgeHog Over a year ago

Like the bouquet of possibilities, learned a few things again today - I find the direct use of pandas just as good as with @jqurious

Jack Fleeting Over a year ago

@HedgeHog - "bouquet of possibilities" - you should definitely spend some of your SO time in a poetry site :).

HedgeHog Over a year ago

Isn't there a little poet in all of us? We are source code poets, writing beautiful lettered soup, lookup for cute pandas, wondering about bouquet of possibilities. You're right, more poetry from today ;)

Collectives™ on Stack Overflow

How to parse nested XML and extract attributes + tag text both?

4 Answers 4

Example

Output

4 Comments

2 Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Example

Output

4 Comments

2 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related