How to loop over a file and delete parts of the file in Python?

Question

I have a data structure such as following.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>

given an input file containing number of files such as

1
3

it would remove the segments that has those name. For example, 1 and 3 was given so segments with names 1 and 3 has been removed.

<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
  </recording>
</corpus>

the code I have so far

with open('file.txt', 'r') as inputFile:
    w_file = inputFile.readlines()

w_file = w_file.strip('\n')

with open('to_delete_nums.txt', 'r') as File:
    d_file = deleteFile.readlines()

d_file = d_file.strip('\n')

for line in w_file:
    if line.contains("<segment name"):
        for d in d_file:
            //if segment name is equal to d then delete that segment.

How do I accomplish this? I also think having 2 might be unnecessary is that correct?

better use lxml or BeautifulSoup to parse XML and works with elements in tree. — furas
– furas, Commented Jan 24, 2021 at 4:59

DRPK · Accepted Answer · 2021-01-24 07:26:17Z

2

Method 1 (with a module):

Just like @iain-shelvington said with a XML parsing/manipulation library You can do it simply and fast;

Try this with lxml module and xpath:

import lxml.etree as et

xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>"""
tree = et.XML(xml.encode())
find_segments = tree.xpath("*//segment[@name='1' or @name='2']") # you can add more segments here

for each_segment in find_segments:
    each_segment.getparent().remove(each_segment)

clean_content = str(et.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")
print(clean_content)

Some credits to @cédric-julien, @Sheena, @xyz, @josh-allemon and these questions:

Method 2 (Hard Code):

xml = """<?xml version='1.0' encoding='UTF-8'?>
<corpus name="corpus">
  <recording audio="audio.wav" name="first audio">
    <segment name="1" start="0" end="2">
        <orth>some text 1</orth>
    </segment>
    <segment name="2" start="2" end="4">
        <orth>some text 2</orth>
    </segment>
    <segment name="3" start="4" end="6">
        <orth>some text 3</orth>
    </segment>
  </recording>
</corpus>"""

lines = []
toggle = True
for each_line in xml.splitlines():
    if each_line.strip().startswith("<segment") and ('name="1"' in each_line or 'name="2"' in each_line):
        toggle = False
    elif each_line.strip().startswith("</segment>") and toggle is False:
        toggle = True
    elif toggle:
        lines.append(each_line)

new_xml = "\n".join(lines)
print(new_xml)

If you want to read names from file then try this:

from lxml import etree

with open("xml.txt", "r") as xml_file:
    xml_data = xml_file.read()

with open('nums.txt', 'r') as file:
    list_of_names = file.read().split("\n")

new_xml = xml_data
for each_name in list_of_names:
    tree = etree.XML(new_xml.encode())
    find_segments = tree.xpath("*//segment[@name='{}']".format(each_name))
    for each_segment in find_segments:
        each_segment.getparent().remove(each_segment)
    new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)

Much Shorter:

from lxml import etree

with open("xml.txt", "r") as xml_file:
    tree = etree.XML(xml_file.read().encode())

with open('nums.txt', 'r') as file:
    list_of_names = list(set(file.read().split("\n")))

xpath = "*//segment[{}]".format(" or ".join(["@name='{}'".format(each_name) for each_name in list_of_names]))

print(xpath)
for each_segment in tree.xpath(xpath):
    each_segment.getparent().remove(each_segment)
new_xml = str(etree.tostring(tree, pretty_print=True, xml_declaration=True), encoding="utf-8")

print(new_xml)

edited Jan 24, 2021 at 7:26

answered Jan 24, 2021 at 5:00

DRPK

2,0912 gold badges16 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

Joseph Kars Over a year ago

Thank you for your comment! @name='1' or @name='2' seems like it would be a lot of manual input. Is there a way to automatically read those from a file? In the question, I say that there is already a file containing the names one per line.

DRPK Over a year ago

@JosephKars; yes wait i will write it.

DRPK Over a year ago

@JosephKars; check my update and notify me

DRPK Over a year ago

@JosephKars: updated check again. this one is much shorter than that

Joseph Kars Over a year ago

Using your last code, I got TypeError: str() takes at most 1 argument (2 given)

|

Collectives™ on Stack Overflow

How to loop over a file and delete parts of the file in Python?

1 Answer 1

Method 1 (with a module):

Method 2 (Hard Code):

18 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Method 1 (with a module):

Method 2 (Hard Code):

18 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related