Parsing huge xml file with invalid characters

I have a 1GB xml file but it has some invalid characters like '&'. I want to parse it in Python. To do this, I used element tree as below:

import xml.etree.cElementTree as cElementTree                             

def main(): 
   context = cElementTree.iterparse('newscor.xml', events=("start", "end"))
   context = iter(context)
   event, root = context.__next__()

   for event, elem in context:
     if event == "start":
         if elem.tag == 'group': 
            elem.tail = None
            print ( elem.text)
         if elem.tag in ['group']:
            root.clear()                                               
main()

But it gave me following error in this line for event, elem in context:

xml.etree.ElementTree.ParseError: not well-formed (invalid token)

To handle this error, I tried to use lxml with recover=True for parser as described in this link. However, iterparse() does not have a parameter for parser in lxml.

Therefore, I also tried to use Sax in this solution but I don't know where to use escape method.

What can I use to avoid invalid characters and parse this large file?

edited Nov 22, 2017 at 15:14

Adam J

1,4482 gold badges18 silver badges35 bronze badges

asked Nov 22, 2017 at 14:35

Arife Kübra

437 bronze badges

Try to use lxml with the HTML parser instead of the standard XML parser. The HTML parser is more forgiving with invalid input. Alternatively you can try to use HTML tidy in XML mode to repair the file. There even is a Python package for it.

Tomalak
– Tomalak

2017-11-22 14:39:08 +00:00
Commented Nov 22, 2017 at 14:39
or you can use perl/python's regex package to pre-process your xml file to get rid of the & sign.

vtd-xml-author
– vtd-xml-author

2017-11-23 00:21:21 +00:00
Commented Nov 23, 2017 at 0:21
I solved this problem with tidy tool (thanks Tomalak for your comment) Tidy tool converts special character & as &amp.

Arife Kübra
– Arife Kübra

2017-11-27 18:20:03 +00:00
Commented Nov 27, 2017 at 18:20

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Parsing huge xml file with invalid characters

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked