1

My XML looks like this:

<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" new_id="12">
        <level>1&amp;1</level>
        <typ>Green</typ>
        <name>Alpha</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
    <offer id="12" new_id="31">
        <level>1&amp;1</level>
        <typ>Yellow</typ>
        <name>Beta</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
</details>
</main_heading>

I want to parse certain fields into a dataframe.

Expected Output

timestamp   id     new_id   level      name
20220113    11     12       1&amp;1    Alpha
20220113    12     31       1&amp;1    Beta

where NAME nested within the "visits" tag is not included. I just want to consider the outer "name" tag.

timestamp = soup.find('main_heading').get('timestamp')
df[timestamp'] = timestamp

this solves one part

The rest I can do like this:

typ = []
for i in (soup.find_all('typ')):
    typ.append(i.text)

but i don't want to create several for loops for every new field

5
  • what exactly do you expect? Commented Jan 23, 2023 at 13:35
  • expected output is given in the qs above. A dataframe. @eike Commented Jan 23, 2023 at 13:38
  • expected output, yes, but not the constraints for the algorithm. you don't want to use for loops at all? Commented Jan 23, 2023 at 13:39
  • I am open to suggestions but i am hoping for something where i don't have to create a new long loop for each field (just in case i have too many fields to extract) if possible @eike Commented Jan 23, 2023 at 13:40
  • If you are only interested in single subfields of offer, would one loop over all offers be acceptable? Commented Jan 23, 2023 at 13:41

4 Answers 4

3

Iterate over the offers and select its previous main_heading:

for e in soup.select('offer'):
    data.append({
        'timestamp': e.find_previous('main_heading').get('timestamp'),
        'id':e.get('id'),
        'id_old':e.get('old_id'),
        'level':e.level.text,
        'typ':e.typ.text,
        'name':e.select_one('name').text
    })

Or in alternative to exclude only some elements and be more generic:

for e in soup.select('offer'):
    
    d = {
        'timestamp': e.find_previous('main_heading').get('timestamp'),
        'id':e.get('id'),
        'id_old':e.get('old_id'),
    }

    d.update({c.name:c.text for c in e.children if c.name is not None and 'visits' not in c.name})

    data.append(d)

Example

from bs4 import BeautifulSoup
import pandas as pd

xml = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" new_id="12">
        <level>1&amp;1</level>
        <typ>Green</typ>
        <name>Alpha</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
    <offer id="12" new_id="31">
        <level>1&amp;1</level>
        <typ>Yellow</typ>
        <name>Beta</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
</details>
</main_heading>
'''
soup = BeautifulSoup(xml,'xml')

data = []

for e in soup.select('offer'):
    data.append({
        'timestamp': e.find_previous('main_heading').get('timestamp'),
        'id':e.get('id'),
        'id_old':e.get('old_id'),
        'level':e.level.text,
        'typ':e.typ.text,
        'name':e.select_one('name').text
    })

pd.DataFrame(data)

Output

timestamp id id_old level typ name
0 20220113 11 1&1 Green Alpha
1 20220113 12 1&1 Yellow Beta
Sign up to request clarification or add additional context in comments.

4 Comments

Just out of interest, in this situation, is there a difference between select("offer") and find_all("offer")?
Not in this specific case, cause both use the elements name, but in general the fact that select uses css selectors -> crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
@eike In addition, may check this q/a: stackoverflow.com/questions/38028384/…
@HedgeHog I think xml.etree.ElementTree is good enough here.
2

pandas has .read_xml()

You can use xpath= to pass custom XPath expressions to specify what to extract.

For example, <offer> and <main_heading> tags:

>>> pd.read_xml("main.xml", xpath="""//*[name() = "offer" or name() = "main_heading"]""")
    timestamp  details    id  new_id level     typ   name  visits
0  20220113.0      NaN   NaN     NaN  None    None   None     NaN
1         NaN      NaN  11.0    12.0   1&1   Green  Alpha     NaN
2         NaN      NaN  12.0    31.0   1&1  Yellow   Beta     NaN

From there you could .ffill() the timestamp and drop the details/visits columns:

>>> (pd.read_xml("main.xml", xpath="""//*[name() = "offer" or name() = "main_heading"]""")
...    .ffill()
...    .drop(columns=["details", "visits"])
...    .dropna()
... )
    timestamp    id  new_id level     typ   name
1  20220113.0  11.0    12.0   1&1   Green  Alpha
2  20220113.0  12.0    31.0   1&1  Yellow   Beta

2 Comments

Never thought of using the condition in the xpath, top alternative to work directly with pandas
Looks like we posted a few seconds apart - using the same idea! Great minds think alike (if I say so myself)!
1

No need for any external library.

Core python is enough here.

import xml.etree.ElementTree as ET
import pandas as pd

xml = '''<?xml version="1.0" encoding="UTF-8" ?>
<main_heading timestamp="20220113">
<details>
    <offer id="11" new_id="12">
        <level>1&amp;1</level>
        <typ>Green</typ>
        <name>Alpha</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
    <offer id="12" new_id="31">
        <level>1&amp;1</level>
        <typ>Yellow</typ>
        <name>Beta</name>
        <visits>
            <name>DONT INCLUDE</name>
        </visits>
    </offer>
</details>
</main_heading>'''

data = []
root = ET.fromstring(xml)
timestamp = root.attrib.get('timestamp')
for offer in root.findall('.//offer'):
    temp = {'timestamp': timestamp}
    for attr in ['id', 'new_id']:
        temp[attr] = offer.attrib.get(attr)
    for ele in ['level', 'name']:
        temp[ele] = offer.find(ele).text
    data.append(temp)
df = pd.DataFrame(data)
print(df)

output

  timestamp  id new_id level   name
0  20220113  11     12   1&1  Alpha
1  20220113  12     31   1&1   Beta

1 Comment

plausible alternative, will keep it in mind to keep it simpler if necessary.
1

For the sake of completeness (and future visitors) here's another one: since we're dealing with xml and the final output is a dataframe - it's probably best (and simplest) to use pandas.read_xml:

df = pd.read_xml(xml,xpath='//offer')
ts = pd.read_xml(xml,xpath="//main_heading")['timestamp'][0]
df.insert(0, 'timestamp', ts)
print(df.drop(['typ', 'visits'], axis=1))

And that should get you your expected output.

3 Comments

Like the bouquet of possibilities, learned a few things again today - I find the direct use of pandas just as good as with @jqurious
@HedgeHog - "bouquet of possibilities" - you should definitely spend some of your SO time in a poetry site :).
Isn't there a little poet in all of us? We are source code poets, writing beautiful lettered soup, lookup for cute pandas, wondering about bouquet of possibilities. You're right, more poetry from today ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.