2

I am trying to parse an xml which contains hex value of 𝓅. This represents the mathematical symbol đť“…. The output that I am getting is ��.

What am I doing wrong?

example input xml :

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data>&#x1d4c5;</data>
</root>

output :

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data>&#55349;&#56517;</data>
</root>

Code to obtain XML reader :

factory = org.apache.xerces.jaxp.SAXParserFactoryImpl.newInstance();
final XMLReader xmlReader;
        xmlReader = factory.newSAXParser().getXMLReader();

I am using UTF-8 encoding to decode while parsing.

The code I am using to read and write xml is this method :

public void readAndWriteXml(InputSource inputSource, OutputStream out) throws IOException, SAXException, ParserConfigurationException {

            XMLReader xmlReader = getXmlReader();
            Serializer serializer = SerializerFactory.getSerializer(configProps);
            serializer.setOutputStream(out);
            xmlReader.setContentHandler(serializer.asContentHandler());

            if(logger != null){
                getLogger().debug("starting xml parsing" + LocalTime.now());
            }
            xmlReader.parse(inputSource);
            if(logger != null){
                getLogger().debug("end xml parsing" + LocalTime.now());
            }

        }

getXMLReader() is this :

final XMLReader xmlReader;
        xmlReader = factory.newSAXParser().getXMLReader();
        xmlReader.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
        xmlReader.setFeature("http://xml.org/sax/features" +
                "/namespaces", true);
        xmlReader.setFeature("http://xml.org/sax/features/external-parameter-entities", true);
//        xmlReader.setFeature("http://xml.org/sax/features/validation", true);
        xmlReader.setEntityResolver(wrappedEntityResolver);
        xmlReader.setErrorHandler(new SaxErrorHandler());
        return xmlReader;

Here I am initialising the class :

public XmlNormalizer(String catalogPath) throws IOException {
        // We want the Apache XML parser, not the embedded Oracle Java version.
        factory = org.apache.xerces.jaxp.SAXParserFactoryImpl.newInstance();
        factory.setNamespaceAware(true);
        List<Path> catalogFiles = this.findByFileName(new File(catalogPath).toPath(), CATALOG_FILENAME_PATTERN);
        String[] catalogArray = catalogFiles.stream().map(Path::toString).toArray(String[]::new);
        configProps = OutputPropertiesFactory.getDefaultMethodProperties("xml");
        XMLCatalogResolver xmlCatalogResolver = new XMLCatalogResolver(catalogArray, true);
        wrappedEntityResolver = new WrappedEntityResolver(xmlCatalogResolver);
    }

WrappedEntityResolver is just a wrapper around import org.apache.xerces.util.XMLCatalogResolver;

5
  • Java's chars are 16 bits long. Characters with Unicode code higher than 65535 must be represented by two Java chars. Commented Sep 18, 2023 at 10:33
  • Is there a way to get the mathematical symbol in the output in java? @MauricePerry Commented Sep 18, 2023 at 10:36
  • 1
    Have a look at that: stackoverflow.com/a/36394647/7036419 Commented Sep 18, 2023 at 10:37
  • Can you post a little more of your code to demonstrate how the content is read and written? Commented Sep 18, 2023 at 11:13
  • What is in configProps? What is wrappedEntityResolver? Commented Sep 18, 2023 at 13:23

1 Answer 1

1

That output is most definitely wrong, but it's hard to tell why.

What are the properties passed to the serializer?

If you serialize with Saxon, then with default encoding (UTF-8) the output is

<?xml version="1.0" encoding="UTF-8"?><root>
   <data>đť“…</data>
</root>

while with encoding=us-ascii the output is:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <data>&#x1d4c5;</data>
</root>
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot @Michael, it worked with saxon. Is there a way to pass some properties to the serializer I am using above. I wonder why it resolves it like that
configProps contains the relevant properties and it would be useful to know what values you are setting, but this doesn't explain the bug.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.