0

I'm trying to parse a large XML document and extract the <Text> tag content only.

XML document:

<?xml version="1.0" encoding="UTF-8"?>
<EchoroukonlineData>
<Echoroukonline>
 <ID>SHG_ARB_0000001</ID>
 <URL>http://www.echoroukonline.com/ara/articles/1.html</URL>
 <Headline>title</Headline>
 <Dateline>2008/02/22</Dateline>
 <Text>Text that should be parsed <!--><li><p><--></Text>
</Echoroukonline>
</EchoroukonlineData>

I'm using SAX parser to do this task as follows:

import xml.sax
import pandas as pd
from xml.sax.saxutils import escape
articles = []

class articlesHandler(xml.sax.ContentHandler):
    def startElement(self, name, attrs):
        self.current = name
      
    def characters(self, content):
        if self.current == "Text":
            self.Text = content
            
    def endElement(self, name):
        if self.current == "Text":
            text=self.Text
            articles.append(text)
            
handler = articlesHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
parser.parse('dataset.xml')

The problem is the <Text> tag contains XML special charachters like <, >, I want to ignore those special characters. There is a function that escapes the special characters xml.sax.saxutils.escape(data). I used it in the characters() function as follows:

def characters(self, content):
        if self.current == "Text":
            self.Text = escape(content)

but it still doesn't work.
The error message: xml.sax._exceptions.SAXParseException: dataset.xml:8:1756: not well-formed (invalid token)

1
  • The XML in the question is well-formed, so the error message must be caused by a different XML document. Commented Dec 10, 2022 at 14:53

1 Answer 1

0
def characters(self, content):
  if self.current == "Text":
  self.Text = re.sub('[<>,]', '', content)

This will remove any "<", ">" or "," characters from the XML text before storing it in the "self.Text" variable.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.