2

I have following XML file structure:

<doc id="4611827073121129112">
<class name="tag:September_11" val="-0.079590" />
<class name="tag:Theater" val="-0.134223" />
<class name="tag:Adaptation" val="-0.106678" />
<class name="tag:Paranormal" val="-0.183504" />
<class name="tag:Magic" val="-0.179214" />
<class name="tag:Comedy_Drama" val="-0.044658" />
<class name="tag:Fashion" val="-0.280695" />
<class name="tag:Running" val="0.160316" />
<class name="tag:Construction" val="-0.072044" />
<class name="tag:Suspense_Thriller" val="-0.370302" />
<class name="tag:Space" val="-0.239723" />
<class name="tag:Police" val="-0.139019" />
<class name="tag:Hip-Hop_&_Rap_Music" val="-0.290353" />
<class name="tag:Surfing" val="-0.027105" />
<class name="tag:Halloween" val="-0.236606" />
<class name="tag:Mystery_&_Suspense" val="0.005384" />
<class name="tag:Educational" val="-0.166370" />
<class name="tag:Biography" val="-0.126149" />
<class name="tag:Religion" val="-0.034275" />
<class name="tag:Cartoon" val="-0.487169" />
<class name="tag:Auto_Racing" val="-0.047648" />
<class name="tag:Pets" val="-0.118809" />
</doc>

file doesn't have xml extension for example file name is test.values

to try out first I decided to use python shell: I am using Anaconda version version of python:

Python 2.7.9 |Anaconda 2.1.0 (x86_64)| (default, Dec 15 2014, 10:37:34)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org

>>> import xml.etree.ElementTree as ET
>>> tree = ET.parse('test.values')

I am getting following error:

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 14, column 43

After close look at the line:

<class name="tag:Hip-Hop_&_Rap_Music" val="-0.290353" />

I realize it has & and I some what guessed was thet xml.etree package doesn't open the file in utf mode:

I manually delete & and things work fine. But problem is that I have to read large amount of files and parse. Based on my google search I couldn't find any examples that shows etree package opening files in utf-8 mode. How do I resolve this issue?

1 Answer 1

1

You're right that it's the &, but not that it has to do with Unicode (though perhaps Unicode issues could come up after you solve this one.

You can't have ampersand or less-than inside an attribute value in XML, unless you escape it (as &amp; or &lt; respectively). So whatever program wrote the XML should be fixed to detect and re-code those characters.

Sign up to request clarification or add additional context in comments.

2 Comments

These are data dumps from 3rd party so its pretty much impossible to change the originating code. I guess I have to read line by line and strip those characters.
re.sub(r'&', '&amp;', s)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.