0

I have a 1GB xml file but it has some invalid characters like '&'. I want to parse it in Python. To do this, I used element tree as below:

import xml.etree.cElementTree as cElementTree                             

def main(): 
   context = cElementTree.iterparse('newscor.xml', events=("start", "end"))
   context = iter(context)
   event, root = context.__next__()

   for event, elem in context:
     if event == "start":
         if elem.tag == 'group': 
            elem.tail = None
            print ( elem.text)
         if elem.tag in ['group']:
            root.clear()                                               
main()

But it gave me following error in this line for event, elem in context:

xml.etree.ElementTree.ParseError: not well-formed (invalid token)

To handle this error, I tried to use lxml with recover=True for parser as described in this link. However, iterparse() does not have a parameter for parser in lxml.

Therefore, I also tried to use Sax in this solution but I don't know where to use escape method.

What can I use to avoid invalid characters and parse this large file?

3
  • Try to use lxml with the HTML parser instead of the standard XML parser. The HTML parser is more forgiving with invalid input. Alternatively you can try to use HTML tidy in XML mode to repair the file. There even is a Python package for it. Commented Nov 22, 2017 at 14:39
  • or you can use perl/python's regex package to pre-process your xml file to get rid of the & sign. Commented Nov 23, 2017 at 0:21
  • I solved this problem with tidy tool (thanks Tomalak for your comment) Tidy tool converts special character & as &amp. Commented Nov 27, 2017 at 18:20

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.