I have a Java program that was working perfectly in Corretto 17, but is now having character set encoding issues in Corretto 25.
I am reading a UTF-8 encoded XML from an external API. The code is quite simple: I form an HTTPUrlConnection and I have a class that extends DefaultHandler:
URI uri = new URI(url);
URL authenticatedURL = uri.toURL();
HttpURLConnection connection = (HttpURLConnection) authenticatedURL.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Authorization", "Bearer " + BEARER_TOKEN);
// connection.setRequestProperty("Accept-Charset", "UTF-8"); // This line of code seems to have no impact.
[...]
InputStream connectionInputStream = connection.getInputStream();
InputSource connectionInputSource = new InputSource(connectionInputStream);
connectionInputSource.setEncoding(StandardCharsets.UTF_8.displayName());
parser.parse(connectionInputSource, dh);
// parser.parse(connection.getInputStream(), dh); // This 1 line seems to work the same as the above 4 lines.
My understanding is that Java uses UTF-16 for all Strings, but also it assumes some inputs (e.g. XML) will be in UTF-8, for instance the Attributes class used by DefaultHandler. I'm assuming this UTF-8 assumption is why explicitly setting the charset / encoding in the code above makes no difference. Is this correct?
The issue I'm having is I don't understand when / how the UTF-8 I read in is converted to UTF-16. For instance, in my extension of DefaultHandler, attributes.getValue() seems to return a UTF-8 encoded String, but Float.parseFloat() works perfectly:
public void startElement(String uri, String localName, String qName, Attributes attributes) {
if ("name".equals(qName)) {
if ("primary".equals(attributes.getValue("type"))) {
if (attributeHasValue(attributes, "value")) {
primaryName = attributes.getValue("value"); // Seems to store UTF-8 string.
}
}
} else if ("averageweight".equals(qName)) {
if (attributeHasValue(attributes, "value")) {
averageWeight = Float.parseFloat(attributes.getValue("value"));
}
[...]
Outputs (when I print the values to System.out.println):
Primary name = Orl�ans // NOT GOOD!
Average weight = 3.0137 // But Floats and Integers are parsed just fine.
I suppose I have to explicitly convert the value returned by attributes.getValue() from UTF-8 to UTF-16, is that correct?
But, if so, why are numeric values being parsed correctly?
And, I currently assume the qName and localName parameters are provided in UTF-16, is that correct?
I'm just confused in general / don't have the right mental model of what the SAXParser + DefaultHandler are doing encoding-wise, because I don't understand how most of my code is working if the encoding is wrong everywhere.
System.out.println("Orl\u00E9ans");? That's definitely the right string value... so if that appears the same way in your console, it's probably nothing to do with the XML side of things...System.getProperty("file.encoding")to make sure it's utf-8 and not something else.-Dstdout.encoding=UTF-8at launch solve the problem?