Skipping files if xml like part is missing

Question

I am analyzing the xml data of several files. To get my data, I first need to split the xml data from the whole file to be able to work with it.

For this I use the split() method and search for <Data.

Here I run into a problem.

Some of the files have no xml data in them and therefore these files I would like to simply skip.

path = r"C:\Users\Nathan\Desktop\Test\*.xml"

for xml in glob.glob(path):
    with open(xml) as data_file
        file_content = data_file.read

        xml_part1 = file_content.split("<Data",1)[1] #here i get an Error if "<Data" is not in the file In
        xml_part2 = file_content.split("Data>",1)[0]
        xml_file = "<Data" + xml_part2+"Data>"

For help I would be very grateful

Or use a regular expression <Data.*?Data> to get the XML portion of the file. — Barmar
– Barmar, Commented Sep 1, 2023 at 14:48
This is just horrible. Mixing XML and non-XML syntax in a single file is making your life really difficult. Much better to use XML for the whole thing - or at least embed it cleanly in some other language like JSON that you can parse with a proper parser. — Michael Kay
– Michael Kay, Commented Sep 1, 2023 at 20:18

Parfait · Accepted Answer · 2023-09-02 16:33:07Z

For future readers, avoid OP's treatment of XML as a regular text file and not as an encoded markup document with properties of elements and attributes. Consider using compliant DOM libraries such as Python's etree or lxml to parse relevant content.

Specifically, the XPath expression, ".//Data", can check this element in document with a logical test of find for element existence or findtext for text existence.

import xml.etree.ElementTree as ET
import glob

path = r"C:\Users\Nathan\Desktop\Test\*.xml"

for xml in glob.glob(path):
    # PARSE DOCUMENT INTO XML TREE
    tree = ET.parse(xml)

    # SEARCH ELEMENT AND TEXT
    data_elem = tree.find(".//Data")
    data_text = tree.findtext(".//Data")

    if data_text is not None and data_text != "":
        # DO STUFF WITH PARSED Data ELEMENT TREE OR TEXT
        data_elem
        data_text
    else:
        print(f"Skipping file {xml} as it does not contain <Data>")

Ada · Accepted Answer · 2023-09-01 14:49:04Z

0

You can use a try-catch block to catch the exception when the split() method fails because there is no xml data in that file.

import glob

path = r"C:\Users\Nathan\Desktop\Test\*.xml"

for xml in glob.glob(path):
    with open(xml, 'r') as data_file:
        file_content = data_file.read()

        try:
            xml_part1 = file_content.split("<Data", 1)[1]
            xml_part2 = xml_part1.split("Data>", 1)[0]
            xml_data = "<Data" + xml_part2 + "Data>"
            
            # Do stuff with your xml data
            
        except IndexError:
            print(f"Skipping file {xml} as it does not contain <Data>")

answered Sep 1, 2023 at 14:49

Ada

1,9432 gold badges12 silver badges23 bronze badges

Comments

Barmar · Accepted Answer · 2023-09-01 14:52:08Z

0

Use a regexp to get the XML part of the file, then test if it matched anything.

import re ,glob

for xml in glob.glob(path):
    with open(xml, 'r') as data_file:
        file_content = data_file.read()
        match = re.search(r'<Data.*?Data>', file_content)
        if match:
            xml_data = match.group()
            # do stuff with xml_data
        else:
            print(f"Skipping file {xml} as it does not contain <Data>")

answered Sep 1, 2023 at 14:52

Barmar

789k57 gold badges555 silver badges669 bronze badges

Comments

Hermann12 · Accepted Answer · 2023-09-01 20:03:49Z

You can define a xsd and use xmlschema. As an example I have a valid and an invalid xml and a text file.

import xmlschema
from io import StringIO

dtat_xsd = """<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Data">
  <xs:complexType>
  <xs:sequence>
  </xs:sequence>
  </xs:complexType>
  </xs:element>
</xs:schema>"""
xsd = StringIO(dtat_xsd)

vali = """<?xml version="1.0" encoding="utf-8"?>
<Data>
<others />
</Data>"""
f = StringIO(vali)

vali1 = """<?xml version="1.0" encoding="utf-8"?>
<root>
</root>"""
f1 = StringIO(vali1)

tex = """Any text"""

schema = xmlschema.XMLSchema(xsd)
try:
    v = schema.is_valid(f)
    print(v)
    vf = schema.is_valid(f1)
    print(vf)
    te = schema.is_valid(tex)
    print(te)
except:
    print("EXCEPTION: Can't be parsed")

Output:

True
False
EXCEPTION: Can't be parsed

Collectives™ on Stack Overflow

Skipping files if xml like part is missing

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related