python-xml: Not well-formed (invalid token) - xml.etree utf mode

Question

I have following XML file structure:

<doc id="4611827073121129112">
<class name="tag:September_11" val="-0.079590" />
<class name="tag:Theater" val="-0.134223" />
<class name="tag:Adaptation" val="-0.106678" />
<class name="tag:Paranormal" val="-0.183504" />
<class name="tag:Magic" val="-0.179214" />
<class name="tag:Comedy_Drama" val="-0.044658" />
<class name="tag:Fashion" val="-0.280695" />
<class name="tag:Running" val="0.160316" />
<class name="tag:Construction" val="-0.072044" />
<class name="tag:Suspense_Thriller" val="-0.370302" />
<class name="tag:Space" val="-0.239723" />
<class name="tag:Police" val="-0.139019" />
<class name="tag:Hip-Hop_&_Rap_Music" val="-0.290353" />
<class name="tag:Surfing" val="-0.027105" />
<class name="tag:Halloween" val="-0.236606" />
<class name="tag:Mystery_&_Suspense" val="0.005384" />
<class name="tag:Educational" val="-0.166370" />
<class name="tag:Biography" val="-0.126149" />
<class name="tag:Religion" val="-0.034275" />
<class name="tag:Cartoon" val="-0.487169" />
<class name="tag:Auto_Racing" val="-0.047648" />
<class name="tag:Pets" val="-0.118809" />
</doc>

file doesn't have xml extension for example file name is test.values

to try out first I decided to use python shell: I am using Anaconda version version of python:

Python 2.7.9 |Anaconda 2.1.0 (x86_64)| (default, Dec 15 2014, 10:37:34)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.values')

I am getting following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 43

After close look at the line:

<class name="tag:Hip-Hop_&_Rap_Music" val="-0.290353" />

I realize it has & and I some what guessed was thet xml.etree package doesn't open the file in utf mode:

I manually delete & and things work fine. But problem is that I have to read large amount of files and parse. Based on my google search I couldn't find any examples that shows etree package opening files in utf-8 mode. How do I resolve this issue?

TextGeek · Accepted Answer · 2015-03-02 23:25:23Z

1

You're right that it's the &, but not that it has to do with Unicode (though perhaps Unicode issues could come up after you solve this one.

You can't have ampersand or less-than inside an attribute value in XML, unless you escape it (as & or < respectively). So whatever program wrote the XML should be fixed to detect and re-code those characters.

answered Mar 2, 2015 at 23:25

TextGeek

1,24712 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

add-semi-colons Over a year ago

These are data dumps from 3rd party so its pretty much impossible to change the originating code. I guess I have to read line by line and strip those characters.

TextGeek Over a year ago

re.sub(r'&', '&', s)

Collectives™ on Stack Overflow

python-xml: Not well-formed (invalid token) - xml.etree utf mode

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related