3

I'm trying to port some code to Python 3 that passes a parser created by the xml.sax.make_parser function as a second argument to xml.dom.minidom.parseString to parse an XML document.

In Python 3 the parser seems to be unable to parse a XML document as bytes, but I can't know the encoding of the XML document before parsing it. To demonstrate:

import xml.sax
import xml.dom.minidom

def try_parse(input, parser=None):
    try:
        xml.dom.minidom.parseString(input, parser)
    except Exception as ex:
        print(ex)
    else:
        print("OK")

euro = u"\u20AC" # U+20AC EURO SIGN
xml_utf8 = b"<?xml version=\"1.0\" encoding=\"utf-8\"?>"
xml_cp1252 = b"<?xml version=\"1.0\" encoding=\"windows-1252\"?>"

test_cases = [
    b"<a>" + euro.encode("utf-8") + b"</a>",
    u"<a>" + euro + u"</a>",
    xml_utf8 + b"<a>" + euro.encode("utf-8") + b"</a>",
    xml_cp1252 + b"<a>" + euro.encode("cp1252") + b"</a>",
]

for i, case in enumerate(test_cases, 1):
    print("%d: %r" % (i, case))
    try_parse(case)
    try_parse(case, xml.sax.make_parser())

Python 2:

1: '<a>\xe2\x82\xac</a>'
OK
OK
2: u'<a>\u20ac</a>'
'ascii' codec can't encode character u'\u20ac' in position 3: ordinal not in range(128)
'ascii' codec can't encode character u'\u20ac' in position 3: ordinal not in range(128)
3: '<?xml version="1.0" encoding="utf-8"?><a>\xe2\x82\xac</a>'
OK
OK
4: '<?xml version="1.0" encoding="windows-1252"?><a>\x80</a>'
OK
OK

Python 3:

1: b'<a>\xe2\x82\xac</a>'
OK
initial_value must be str or None, not bytes
2: '<a>€</a>'
OK
OK
3: b'<?xml version="1.0" encoding="utf-8"?><a>\xe2\x82\xac</a>'
OK
initial_value must be str or None, not bytes
4: b'<?xml version="1.0" encoding="windows-1252"?><a>\x80</a>'
OK
initial_value must be str or None, not bytes

As you can see, the default parser is able to handle the bytes just fine, but I need the SAX parser to handle parameter entities. Is there any solution to this problem (other than trying to guess the encoding of the bytes before parsing)?

1 Answer 1

1

I seem to have found the cause of the problem. xml.dom.minidom.parseString calls xml.dom.pulldom.parseString if a parser is supplied (via _do_pulldom_parse) which then tries to construct a StringIO to hold the XML document while parsing. Swapping out that StringIO for a BytesIO solves the problem, so I guess I will use the following as a workaround:

from io import StringIO, BytesIO

def parseMaybeBytes(string, parser):
    bufsize = len(string)
    stream_class = BytesIO if isinstance(string, bytes) else StringIO
    buf = stream_class(string)
    return xml.dom.pulldom.DOMEventStream(buf, parser, bufsize)

def parseString(string, parser=None):
    """Parse a file into a DOM from a string."""
    if parser is None:
        return xml.dom.minidom.parseString(string)

    return xml.dom.minidom._do_pulldom_parse(parseMaybeBytes, (string,),
                                             {'parser': parser})
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.