1

I am trying to parse multiple xml files for a specific tag, and if file contains that tag, then extract the text associated with tag.

I am learning Python on and off for over a year, and this is my first attempt at dealing with xml.

here is my code where changeM is the tag of interest:

import os
import glob
import xml.etree.ElementTree as ET
import pandas as pd

read_files = glob.glob(os.path.join(path, '*.xml'))

for file in read_files:
    
    new_tree = ET.parse(file)
    root = new_tree.getroot()
    
    changes=[]
    for elm in root.findall('.//para[@changeM="1"]'):
        changes.append(elm.text)

The list named 'changes' is blank. Alternatively if I discard the list in the above code, I sub a print statement, then it picks up one of the text but prints the same text match repeatedly.

2 Answers 2

1

Consider a list/dict comprehension using a user defined method:

def parse_data(xml_file):
   doc = ET.parse(xml_file)

   # LIST COMPREHENSION
   elem_texts = [elem.text for elem in doc.findall(".//para[@changeMark='1']")]

   return elem_texts


# DICT WITH FILE NAMES FOR KEYS AND PARSED TEXT LISTS FOR VALUES
changes_dict = {f:parse_data(f) for f in read_files if re.match(r'.*EN.*', f)}

# FLAT LIST WITH NO FILE INDICATOR
changes_list = [item for f,lst in changes_dict.items() for item in lst]
Sign up to request clarification or add additional context in comments.

2 Comments

thanks that works and is elegant. In reality I have am dealing with multiple xml files, each file so complicated in terms of having c.20+ parent-child relationships (it becomes very difficult to read even with various xml viewers). I am able to extract data from nodes of interest, but only in isolation- I find it hard to grab two related nodes. Any guidance?
Great to hear and glad to help! There are many ways to parse an XML file across elements and attributes. Your follow-up question is a bit broad and specific desired result is unknown. There are no standard XML structure for a blanket, one-all solution. Specific context is required. Consider researching XPath/XSLT, attempting a solution, and if needed asking a new question with specific XML sample and specific full, desired result.
0

Think I have this worked out.

changes =[]
for file in read_files:
     if re.match(r'.*EN.*', file):
        tree = ET.parse(file) # gives an element tree
        root = tree.getroot() # gives an Element, the root element
        for elm in root.findall('.//para[@changeMark="1"]'):
            changes.append(elm.text)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.