3

I have a xml structure with some elements which are not unique. So I managed to sort the subtrees and I can filter propper the elements which I have more than one time. But the remove function seems not to apply.

My XML Structure looks simplified like this:

<root>
  <page>
    <text>blabla blub unique</text>
    <text>blabla blub not unique</text>
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub not unique</text><!-- line should be removed -->
    <text>blabla blub again unique</text>
  </page>
  <page>
    <text>2nd blabla blub unique</text>
    <text>2nd blabla blub not unique</text>
    <text>2nd blabla blub not unique</text><!-- line should be removed -->
    <text>2nd blabla blub again unique</text>
  </page>
</root>

I want to remove double strings on each page, so I'm iterating over pages and over elements in page in two for loops: (extract of important lines, I hope didn't forget anything)

import xml.etree.ElementTree as ET
self.tree = ET.parse(path)
self.root = self.tree.getroot()
self.prev = None
# [...]
for page in self.root:                     # iterate over pages
    for elem in page:
        if elements_equal(elem, self.prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            page.remove(elem) # <---- removes just one line
            continue
        self.prev = elem
# [...]
self.tree.write("out.xml") # 2 duplicate lines still there....

update: The code seems to work, but it removes just one duplicate, not all

6
  • 1
    Did you try root.remove(elem) instead of page? Commented Dec 18, 2014 at 15:14
  • 1
    i think it's a list ; if so, try making it a set and see if the duplicates are removed. I guess it boils down on how the eq method is implemented for a node, if at all Commented Dec 18, 2014 at 15:14
  • it should be an element of xmlTree Object, but I have no glue how it is implemented. When I try to remove from root it says: ValueError: list.remove(x): x not in list Commented Dec 18, 2014 at 15:23
  • how to make it a set? What do you mean by eq method? @omu_negru Commented Dec 18, 2014 at 15:25
  • well, just doing set(your_list) , or any iterator for that matter , should do the trick. To check if the eq method is properly implemented, get the second and third nodes and see if second == third returns true (it should) Commented Dec 18, 2014 at 15:29

1 Answer 1

4

I don't know how you've defined elements_equal, but (shamelessly adapted from Testing Equivalence of xml.etree.ElementTree) this works for me:

EDIT: store a list of each element to be removed whilst iterating over page and then remove them rather than doing the removal within one loop.

EDIT: Noticed a small typo in the code in the comparison of the element tags and correct it.

import xml.etree.ElementTree as ET

path = 'in.xml'

tree = ET.parse(path)
root = tree.getroot()
prev = None

def elements_equal(e1, e2):
    if type(e1) != type(e2):
        return False
    if e1.tag != e2.tag: return False
    if e1.text != e2.text: return False
    if e1.tail != e2.tail: return False
    if e1.attrib != e2.attrib: return False
    if len(e1) != len(e2): return False
    return all([elements_equal(c1, c2) for c1, c2 in zip(e1, e2)])

for page in root:                     # iterate over pages
    elems_to_remove = []
    for elem in page:
        if elements_equal(elem, prev):
            print("found duplicate: %s" % elem.text)   # equal function works well
            elems_to_remove.append(elem)
            continue
        prev = elem
    for elem_to_remove in elems_to_remove:
        page.remove(elem_to_remove)
# [...]
tree.write("out.xml")

Gives:

$ python undupe.py
found duplicate: blabla blub not unique
found duplicate: 2nd blabla blub not unique
$ cat out.xml
<root>
  <page>
    <text>blabla blub unique</text>
    <text>blabla blub not unique</text>
    <text>blabla blub again unique</text>
  </page>
  <page>
    <text>2nd blabla blub unique</text>
    <text>2nd blabla blub not unique</text>
    <text>2nd blabla blub again unique</text>
  </page>
Sign up to request clarification or add additional context in comments.

6 Comments

Equals function works proper, that's not the problem. But I have a similar, thanks.
@abimelex: then there is something else I don't understand about your problem. I've posted my full working code in case it helps.
haha ok, I was thinking in wrong direction... the program works and also your code with the example. The Problem comes with my example, when we have not just two unique elements. Like 4 times the same row. Both codes are somehow not deleting all duplicates... don't know why... @xnx updated my question
Ah yes - it's not working because you're (we're) removing the text elements within the iteration over page's children. This won't work because removing the element will bring the iteration to a premature halt
oh okay... I don't get the behavior reason, but I figured out a solution... I will just edit your answer and then accept it if it's okay for you.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.