1

I am analyzing the xml data of several files. To get my data, I first need to split the xml data from the whole file to be able to work with it.

For this I use the split() method and search for <Data.

Here I run into a problem.

Some of the files have no xml data in them and therefore these files I would like to simply skip.

path = r"C:\Users\Nathan\Desktop\Test\*.xml"

for xml in glob.glob(path):
    with open(xml) as data_file
        file_content = data_file.read

        xml_part1 = file_content.split("<Data",1)[1] #here i get an Error if "<Data" is not in the file In
        xml_part2 = file_content.split("Data>",1)[0]
        xml_file = "<Data" + xml_part2+"Data>"

For help I would be very grateful

6
  • stackoverflow.com/a/4592220/9296093 Commented Sep 1, 2023 at 14:45
  • How about simple condition: if '<Data' in file_content: Commented Sep 1, 2023 at 14:46
  • 1
    Or use a regular expression <Data.*?Data> to get the XML portion of the file. Commented Sep 1, 2023 at 14:48
  • You're missing the () after .read Commented Sep 1, 2023 at 14:49
  • This is just horrible. Mixing XML and non-XML syntax in a single file is making your life really difficult. Much better to use XML for the whole thing - or at least embed it cleanly in some other language like JSON that you can parse with a proper parser. Commented Sep 1, 2023 at 20:18

4 Answers 4

1

For future readers, avoid OP's treatment of XML as a regular text file and not as an encoded markup document with properties of elements and attributes. Consider using compliant DOM libraries such as Python's etree or lxml to parse relevant content.

Specifically, the XPath expression, ".//Data", can check this element in document with a logical test of find for element existence or findtext for text existence.

import xml.etree.ElementTree as ET
import glob

path = r"C:\Users\Nathan\Desktop\Test\*.xml"

for xml in glob.glob(path):
    # PARSE DOCUMENT INTO XML TREE
    tree = ET.parse(xml)

    # SEARCH ELEMENT AND TEXT
    data_elem = tree.find(".//Data")
    data_text = tree.findtext(".//Data")

    if data_text is not None and data_text != "":
        # DO STUFF WITH PARSED Data ELEMENT TREE OR TEXT
        data_elem
        data_text
    else:
        print(f"Skipping file {xml} as it does not contain <Data>")
Sign up to request clarification or add additional context in comments.

Comments

0

You can use a try-catch block to catch the exception when the split() method fails because there is no xml data in that file.

import glob

path = r"C:\Users\Nathan\Desktop\Test\*.xml"

for xml in glob.glob(path):
    with open(xml, 'r') as data_file:
        file_content = data_file.read()

        try:
            xml_part1 = file_content.split("<Data", 1)[1]
            xml_part2 = xml_part1.split("Data>", 1)[0]
            xml_data = "<Data" + xml_part2 + "Data>"
            
            # Do stuff with your xml data
            
        except IndexError:
            print(f"Skipping file {xml} as it does not contain <Data>")

Comments

0

Use a regexp to get the XML part of the file, then test if it matched anything.

import re ,glob

for xml in glob.glob(path):
    with open(xml, 'r') as data_file:
        file_content = data_file.read()
        match = re.search(r'<Data.*?Data>', file_content)
        if match:
            xml_data = match.group()
            # do stuff with xml_data
        else:
            print(f"Skipping file {xml} as it does not contain <Data>")

Comments

0

You can define a xsd and use xmlschema. As an example I have a valid and an invalid xml and a text file.

import xmlschema
from io import StringIO

dtat_xsd = """<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name="Data">
  <xs:complexType>
  <xs:sequence>
  </xs:sequence>
  </xs:complexType>
  </xs:element>
</xs:schema>"""
xsd = StringIO(dtat_xsd)

vali = """<?xml version="1.0" encoding="utf-8"?>
<Data>
<others />
</Data>"""
f = StringIO(vali)

vali1 = """<?xml version="1.0" encoding="utf-8"?>
<root>
</root>"""
f1 = StringIO(vali1)

tex = """Any text"""

schema = xmlschema.XMLSchema(xsd)
try:
    v = schema.is_valid(f)
    print(v)
    vf = schema.is_valid(f1)
    print(vf)
    te = schema.is_valid(tex)
    print(te)
except:
    print("EXCEPTION: Can't be parsed")

Output:

True
False
EXCEPTION: Can't be parsed

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.