0

I have a data structure such as following.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>

given an input file containing number of files such as

1
3

it would remove the segments that has those name. For example, 1 and 3 was given so segments with names 1 and 3 has been removed.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
  </recording>
</corpus>

the code I have so far

with open('file.txt', 'r') as inputFile:
    w_file = inputFile.readlines()

w_file = w_file.strip('\n')

with open('to_delete_nums.txt', 'r') as File:
    d_file = deleteFile.readlines()

d_file = d_file.strip('\n')

for line in w_file:
    if line.contains("<segment name"):
        for d in d_file:
            //if segment name is equal to d then delete that segment.

How do I accomplish this? I also think having 2 might be unnecessary is that correct?

3
  • 5
    Why not use an XML parsing/manipulation library? Commented Jan 24, 2021 at 4:25
  • what exactly output you want to get please give that data Commented Jan 24, 2021 at 4:31
  • 2
    better use lxml or BeautifulSoup to parse XML and works with elements in tree. Commented Jan 24, 2021 at 4:59

1 Answer 1

2

Method 1 (with a module):

Just like @iain-shelvington said with a XML parsing/manipulation library You can do it simply and fast;

Try this with lxml module and xpath:

import lxml.etree as et

xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>"""
tree = et.XML(xml.encode())
find_segments = tree.xpath("*//segment[@name='1' or @name='2']") # you can add more segments here

for each_segment in find_segments:
    each_segment.getparent().remove(each_segment)

clean_content = str(et.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(clean_content)

Some credits to @cédric-julien, @Sheena, @xyz, @josh-allemon and these questions:

  1. how to remove an element in lxml
  2. Using an OR condition in Xpath to identify the same element
  3. lxml.etree.XML ValueError for Unicode string

Method 2 (Hard Code):

xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>"""

lines = []
toggle = True
for each_line in xml.splitlines():
    if each_line.strip().startswith("<segment") and ('name="1"' in each_line or 'name="2"' in each_line):
        toggle = False
    elif each_line.strip().startswith("</segment>") and toggle is False:
        toggle = True
    elif toggle:
        lines.append(each_line)

new_xml = "\n".join(lines)
print(new_xml)

If you want to read names from file then try this:

from lxml import etree

with open("xml.txt", "r") as xml_file:
    xml_data = xml_file.read()

with open('nums.txt', 'r') as file:
    list_of_names = file.read().split("\n")

new_xml = xml_data
for each_name in list_of_names:
    tree = etree.XML(new_xml.encode())
    find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
    for each_segment in find_segments:
        each_segment.getparent().remove(each_segment)
    new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)

Much Shorter:

from lxml import etree

with open("xml.txt", "r") as xml_file:
    tree = etree.XML(xml_file.read().encode())

with open('nums.txt', 'r') as file:
    list_of_names = list(set(file.read().split("\n")))

xpath = "*//segment[{}]".format(" or ".join(["@name='{}'".format(each_name) for each_name in list_of_names]))

print(xpath)
for each_segment in tree.xpath(xpath):
    each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)
Sign up to request clarification or add additional context in comments.

18 Comments

Thank you for your comment! @name='1' or @name='2' seems like it would be a lot of manual input. Is there a way to automatically read those from a file? In the question, I say that there is already a file containing the names one per line.
@JosephKars; yes wait i will write it.
@JosephKars; check my update and notify me
@JosephKars: updated check again. this one is much shorter than that
Using your last code, I got TypeError: str() takes at most 1 argument (2 given)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.