1
<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>

How can I parse an XML file that looks like this? Here, I have multiple values within a single tag. I want to extract values such as "ID", and "OLD_ID" in a list or dataframe format.

1

3 Answers 3

2

You could use BeautifulSoup and xml parser to get your goal, simply select the elements needed and iterate ResultSet to extract attribute values via .get().

with open('filename.xml', 'r') as f:
    file = f.read() 
    soup = BeautifulSoup(file, 'xml')

Example

from bs4 import BeautifulSoup
import pandas as pd

xml = '''<?xml version="2.0" encoding="UTF-8" ?><timestamp="20220113">
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>
'''
soup = BeautifulSoup(xml,'xml')


pd.DataFrame(
    [
        (e.get('id'),e.get('old_id'))
        for e in soup.select('defintion')
    ],
    columns = ['id','old_id']
)

Output

id old_id
0 1 0
1 7 1
Sign up to request clarification or add additional context in comments.

3 Comments

Could you also help with a second use case? In this case, I need to extract a combination: attributes of one tag (i.e offer like we did earlier), contents of some tags themselves (eg for level, name), and then the attributes of the first tag (timestamp) whose value would repeat across all fields. I edited the qs
To keep original question clean, this would be predestined for asking a new question with exact this focus - simply drop the link in the comments to reference your new answer. would be great
0

Using python Beautiful Soup, you could parse the .xml file to a Beatuful soup object and then use .findAll('defintions'). Then loop through the tags you find and get the desired values

object.findAll('defintions')

for defintion in defintions:
    old_id = defintions['old_id']
    id = defintions['id']

references: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://linuxhint.com/parse_xml_python_beautifulsoup/

3 Comments

how do you define "object" if you are reading the content from a file?
In newer code avoid old syntax findAll() instead use find_all() or select() with css selectors - For more take a minute to check docs
with open('teachers.xml', 'r') as f: file = f.read() # 'xml' is the parser used. For html files, which BeautifulSoup is typically used for, it would be 'html.parser'. soup = BeautifulSoup(file, 'xml') ref : stackabuse.com/parsing-xml-with-beautifulsoup-in-python
0

If you have a valid XML like (timestamp tag can't have a value like an attribute):

<?xml version='1.0' encoding='utf-8'?>
<root timestamp='20220113'>
<defintions>
    <defintion id="1" old_id="0">Lang</defintion>
    <defintion id="7" old_id="1">Eng</defintion>
</defintions>
</root>

Than you can use pandas:

import pandas as pd

df = pd.read_xml('x89.xml', xpath='.//defintion')
print(df.to_string(index=False))

Output:

 id  old_id defintion
  1       0      Lang
  7       1       Eng

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.