parsing multiple xml file with python and finding specific text in each file, and tabulating the output

Question

I am trying to parse multiple xml files for a specific tag, and if file contains that tag, then extract the text associated with tag.

I am learning Python on and off for over a year, and this is my first attempt at dealing with xml.

here is my code where changeM is the tag of interest:

import os
import glob
import xml.etree.ElementTree as ET
import pandas as pd

read_files = glob.glob(os.path.join(path, '*.xml'))

for file in read_files:
    
    new_tree = ET.parse(file)
    root = new_tree.getroot()
    
    changes=[]
    for elm in root.findall('.//para[@changeM="1"]'):
        changes.append(elm.text)

The list named 'changes' is blank. Alternatively if I discard the list in the above code, I sub a print statement, then it picks up one of the text but prints the same text match repeatedly.

Parfait · Accepted Answer · 2021-01-14 23:19:05Z

1

Consider a list/dict comprehension using a user defined method:

def parse_data(xml_file):
   doc = ET.parse(xml_file)

   # LIST COMPREHENSION
   elem_texts = [elem.text for elem in doc.findall(".//para[@changeMark='1']")]

   return elem_texts


# DICT WITH FILE NAMES FOR KEYS AND PARSED TEXT LISTS FOR VALUES
changes_dict = {f:parse_data(f) for f in read_files if re.match(r'.*EN.*', f)}

# FLAT LIST WITH NO FILE INDICATOR
changes_list = [item for f,lst in changes_dict.items() for item in lst]

answered Jan 14, 2021 at 23:19

Parfait

108k19 gold badges103 silver badges138 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Prolle Over a year ago

thanks that works and is elegant. In reality I have am dealing with multiple xml files, each file so complicated in terms of having c.20+ parent-child relationships (it becomes very difficult to read even with various xml viewers). I am able to extract data from nodes of interest, but only in isolation- I find it hard to grab two related nodes. Any guidance?

Parfait Over a year ago

Great to hear and glad to help! There are many ways to parse an XML file across elements and attributes. Your follow-up question is a bit broad and specific desired result is unknown. There are no standard XML structure for a blanket, one-all solution. Specific context is required. Consider researching XPath/XSLT, attempting a solution, and if needed asking a new question with specific XML sample and specific full, desired result.

Prolle · Accepted Answer · 2021-01-14 22:35:04Z

0

Think I have this worked out.

changes =[]
for file in read_files:
     if re.match(r'.*EN.*', file):
        tree = ET.parse(file) # gives an element tree
        root = tree.getroot() # gives an Element, the root element
        for elm in root.findall('.//para[@changeMark="1"]'):
            changes.append(elm.text)

answered Jan 14, 2021 at 22:35

Prolle

3581 silver badge11 bronze badges

Collectives™ on Stack Overflow

parsing multiple xml file with python and finding specific text in each file, and tabulating the output

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related