0

I want to parse xml file in utf-8 and sort it by some field. Soring is made by custom alphabet (s1 from sourcecode). History of question is here: sorting of list containing utf-8 charachters. I've found how to sort xml here. Sorting work correctly, the problem is with elementtree, I must admit that it doesn't work on python3

Here is source code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#import xml.etree.ElementTree as ET   # Python 2.5
import elementtree.ElementTree as ET
s1='aáàAâÂbBcCçÇdDeéEfFgGğĞhHiİîÎíīıIjJkKlLmMnNóoOöÖpPqQrRsSşŞtTuUûúÛüÜvVwWxXyYzZ'
s2='11111122334455666aabbccddeeeeeeffgghhiijjkklllllmmnnooppqqrrsssssttuuvvwwxxyy'
trans = str.maketrans(s1, s2)
def unikey(seq):
    return seq[0].translate(trans)
tree = ET.parse("tosort.xml")
container = tree.find("entries")
data = []
for elem in container:
    keyd = elem.findtext("k")
    data.append((keyd, elem))
print (data)
data.sort(key=unikey)
print (data)
container[:] = [item[-1] for item in data]
tree.write("sorted.xml", encoding="utf-8")

Here are instructions to import elementtree module. When I import module this way :import xml.etree.ElementTree as ET, I get a message:

Traceback (most recent call last):
File "pcs.py", line 19, in <module>
container[:] = [item[-1] for item in data]
File "/usr/lib/python3.1/xml/etree/ElementTree.py", line 210, in __setitem__
assert iselement(element)
AssertionError

When I use this method to import: import elementtree.ElementTree as ET, I get this message:

Traceback (most recent call last):
File "pcs.py", line 4, in <module>
import elementtree.ElementTree as ET
File "/usr/local/lib/python3.1/dist-packages/elementtree/ElementTree.py", line 794, in <module>
_escape = re.compile(eval(r'u"[&<>\"\u0080-\uffff]+"'))
File "<string>", line 1
u"[&<>\"\u0080-\uffff]+"
                       ^
SyntaxError: invalid syntax

I use Python 3.1.3 (r313:86834, Nov 28 2010, 11:28:10). In python2.6 elementtree work without a problem.

Content of tosort.xml:

<xdxf>
<entries>
<ar><k>zaaaa</k>definition1</ar>
<ar><k>şaaaa</k>definition2</ar>
...
...
</entries>
</xdxf>
4
  • The first code block has indentation problems inside for, could you fix that to match the actual code you run? Commented Jun 22, 2012 at 18:13
  • Also, I think the problem could be that s2 still contains non-ASCII characters, and those mess up the sorting. Commented Jun 22, 2012 at 18:42
  • Oh, sorry. I've fixed that. second code with non-ASCII characters works well. I think that there is something wrong with inout file encoding, but I can't figure out. Commented Jun 22, 2012 at 20:05
  • I've managed to solve sorting problem. Thank you @Lev Levitsky. I've removed all non-ASCII characters from s2 string. Commented Jun 24, 2012 at 9:06

2 Answers 2

1

Looks like you import different modules, one in /usr/lib/python3.1 called xml.etree and the other in /usr/local/lib/python3.1/dist-packages called elementtree. The latter seems broken to me, as for the former, try to remove [:] in the line

 container[:] = [item[-1] for item in data]
Sign up to request clarification or add additional context in comments.

8 Comments

removing [:] didn't help. This line of code is from example. The AssertionError-module seems to work on python2.6. Maybe someone could tell how to make my string translation in python 2.6? Thank you!
@microspace If it didn't help, can you show how the traceback looks without [:]?
I've edited the question. I made traceback with following command print (traceback.format_exc()). Is it correct? I've never printed traceback before... Without [:] sorted data just wasn't written to file...
@microspace You said removing [:] didn't help. With [:] there was an error on that line, and the interpreter printed the traceback (ending with AsertionError). What happens when you remove [:]? You don't need to print the traceback manually.
without [:] programm ended without errors, sorting was accomplished, but to output file unsorted data was written.
|
0

Don't punch me too much but, here is my variant of solution:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET # Python 2.5
from xml.etree.ElementTree import Element
s1="áàaAâÂbBcCçÇdDeéEfFgGğĞhHiİîÎíīıIjJkKlLmMnNóoOöÖpPqQrRsSşŞtTuUûúÛüÜvVwWxXyYzZ"
s2="AAAAAABBCCCCDDEEEFFGGHHddeeeeeeffgghhiijjkklllllmmnnooppqqrrsssssttuuvvwwxxyy"
trans = str.maketrans(s1, s2)
def unikey(seq):
    return seq[0].translate(trans)
tree = ET.parse("tosort.xml")
container = tree.find("entries")
data = []
for elem in container:
    keyd = elem.findtext("k")
    data.append([keyd, elem])
data.sort(key=unikey)
root = tree.getroot()
i=0
for item in data:
    root.append(data[i][1]) # appends sorted Element objects to tree
    i=i+1
#container = [item[-1] for item in data]
root.remove(tree.find("entries")) # removes unsorted Element objects
tree.write("sorted.xml", encoding="utf-8")

Solution is a bit ugly, but it works... I don't know how much time will it take to sort ~50Mb of xml data, but time is not important in my case. Also I've changed sorting pattern a bit because it sorted wrong if there were numbers in words. On Acer extensa 5210 it took no more than 2 min to sort.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.