0

I have a Java program that was working perfectly in Corretto 17, but is now having character set encoding issues in Corretto 25.

I am reading a UTF-8 encoded XML from an external API. The code is quite simple: I form an HTTPUrlConnection and I have a class that extends DefaultHandler:

URI uri = new URI(url);
URL authenticatedURL = uri.toURL();
HttpURLConnection connection = (HttpURLConnection) authenticatedURL.openConnection();
connection.setRequestMethod("GET");
connection.setRequestProperty("Authorization", "Bearer " + BEARER_TOKEN);
// connection.setRequestProperty("Accept-Charset", "UTF-8");  // This line of code seems to have no impact.
[...]
InputStream connectionInputStream = connection.getInputStream();
InputSource connectionInputSource = new InputSource(connectionInputStream);
connectionInputSource.setEncoding(StandardCharsets.UTF_8.displayName());
parser.parse(connectionInputSource, dh);
// parser.parse(connection.getInputStream(), dh);  // This 1 line seems to work the same as the above 4 lines.

My understanding is that Java uses UTF-16 for all Strings, but also it assumes some inputs (e.g. XML) will be in UTF-8, for instance the Attributes class used by DefaultHandler. I'm assuming this UTF-8 assumption is why explicitly setting the charset / encoding in the code above makes no difference. Is this correct?

The issue I'm having is I don't understand when / how the UTF-8 I read in is converted to UTF-16. For instance, in my extension of DefaultHandler, attributes.getValue() seems to return a UTF-8 encoded String, but Float.parseFloat() works perfectly:

public void startElement(String uri, String localName, String qName, Attributes attributes) {
  if ("name".equals(qName)) {
    if ("primary".equals(attributes.getValue("type"))) {
      if (attributeHasValue(attributes, "value")) {
        primaryName = attributes.getValue("value");  // Seems to store UTF-8 string.
      }
    }
  } else if ("averageweight".equals(qName)) {
    if (attributeHasValue(attributes, "value")) {
      averageWeight = Float.parseFloat(attributes.getValue("value"));
    }
  [...]

Outputs (when I print the values to System.out.println):

Primary name = Orl�ans   // NOT GOOD!
Average weight = 3.0137  // But Floats and Integers are parsed just fine.

I suppose I have to explicitly convert the value returned by attributes.getValue() from UTF-8 to UTF-16, is that correct?

But, if so, why are numeric values being parsed correctly?

And, I currently assume the qName and localName parameters are provided in UTF-16, is that correct?

I'm just confused in general / don't have the right mental model of what the SAXParser + DefaultHandler are doing encoding-wise, because I don't understand how most of my code is working if the encoding is wrong everywhere.

7
  • 5
    "Outputs (when I print the values to System.out.println)" - are you sure the problem isn't just that your console doesn't support the value properly? What happens if you write System.out.println("Orl\u00E9ans");? That's definitely the right string value... so if that appears the same way in your console, it's probably nothing to do with the XML side of things... Commented Oct 9 at 17:33
  • Also, check the value of System.getProperty("file.encoding") to make sure it's utf-8 and not something else. Commented Oct 9 at 17:38
  • 1
    @JonSkeet Good catch, it did output "Orl�ans". But now I'm even more confused, because it is the same IDE (Eclipse), code, tests, etc. The only thing I did was upgrade Java (Corretto) version from 17 --> 25. Why would that adversely affect System.out.println when printing \u00E9 ?? (I guess I need to start a new question...) Commented Oct 9 at 17:54
  • 1
    @PhilipH Does setting -Dstdout.encoding=UTF-8 at launch solve the problem? Commented Oct 9 at 21:41
  • 1
    @Slaw This was indeed the issue (see my posted answer for more details). Thanks! Commented Oct 11 at 4:06

2 Answers 2

3

My understanding is that Java uses UTF-16 for all Strings

The API docs do say "A String represents a string in the UTF-16 format". Strings are sequences of chars. chars are 16-bit unsigned integers, and in most contexts they are interpreted as storing UTF-16 code units. However, it is easy to produce Strings that contain invalid UTF-16 code sequences, and such strings are valid as far as Java is concerned.

, but also it assumes some inputs (e.g. XML) will be in UTF-8, for instance the Attributes class used by DefaultHandler

The runtime representation of Strings is a characteristic of the Java language (java.lang.String being among the very few classes addressed by the language and VM specs). On the other hand, the behavior and assumptions of org.xml.sax.helpers.DefaultHandler are characteristics specifically of that class, not of "Java". Comparing these is a category error.

Moreover, no, DefaultHandler does not assume any particular implementation of the org.xml.sax.Attributes interface, and it is not sensitive to the implementation details of Attributes objects passed to it (its startElement() method being the only one that receives these). DefaultHandler's own methods do nothing with the SAX objects passed to them.

If you initialize an InputSource with an InputStream (byte stream) instead of with a Reader (character stream) then it is up to the XMLReader you use to parse it to arrange for decoding the bytes to characters. The API docs don't say how it will do this, but you should expect a quality implementation to

  • use the encoding set on the InputSource if there is one, otherwise
  • use the encoding specification from the initial <!xml ...> directive of the input if there is one and the reader can successfully read it, otherwise
  • use UTF-8, at least if that seems to be working, because that's the default defined by the XML specifications.

I don't understand when / how the UTF-8 I read in is converted to UTF-16.

Most likely, the XMLReader will use the specified or default encoding to construct an InputStreamReader around the InputStream, which it will use to read the data, so that character decoding is performed on the fly as it reads. You could do much the same yourself to initialize the InputSource with a Reader instead of an InputStream.

Definitely the decoding will be performed by or on behalf of the XMLReader, before it constructs the various SAX objects that it passes to your handler's callbacks.

For instance, in my extension of DefaultHandler, attributes.getValue seems to return a UTF-8 encoded String, but Float.parseFloat works perfectly

There is no such thing as a UTF-8 encoded java.lang.String object. The character you see is the Unicode replacement character (which will have appeared in the String in its UTF-16 form). Its appearance in this context almost surely corresponds to an invalid code sequence in the input. It follows that the document in fact is not (wholly) encoded in UTF-8.

It may be that the document is generally in UTF-8, but this character is encoded erroneously. Or it may be that the document is encoded in some other character set (Windows 1252 or ISO-8859-1 being strong possibilities). Those alternatives are not necessarily well distinguished, but I would call it an erroneous code sequence if the document specifies or defaults to UTF-8. I would call it a different encoding only if the document contains an XML declaration specifying an encoding different from UTF-8.

I suppose I have to explicitly convert the value returned by attributes.getValue from UTF-8 to UTF-16. Is that correct?

No, it is not correct. You already have UTF-16. And at that point, you no longer have the original bytes so as to reinterpret them.

And I currently assume the qName and localName parameters are provided in UTF-16. Is that correct?

You're dealing with Strings. These are sequences of (16-bit) char, not of bytes. You have no character decoding to perform on them.

I don't understand how most of my code is working if the encoding is wrong everywhere.

Most likely, the document is (slightly) wrong, not your code. Character decoding must already have been handled by the time you get strings. There are numerous character encoding schemes that are congruent with UTF-8 (and US-ASCII) for code points 0 - 127. I mentioned two of them earlier. Chances are good that one of these is in use instead of UTF-8, at least for the one bad character you observed. That mis-encoding could be happening in the web server software, or below it, at the underlying source of the document.

Sign up to request clarification or add additional context in comments.

2 Comments

Note that you can confirm that the error is in the document, and maybe even figure out what encoding is actually being used, by capturing and examining the raw bytes of the HTTP response body. You might even find that the web server provides a content-type header that tells you that encoding (or alternatively, one that is incorrect in that respect). Note well that accept-charset on the request is advisory only, in the sense that it doesn't guarantee that you will get a response encoded with one of the charsets you designate. That's outside your control.
Thanks John, the above helped improve my understanding. In the end though, it wasn't the document (or my code) that was wrong, but rather an Eclipse standard output bug, easily solved by adding the flag "-Dstdout.encoding=UTF-8".
3

Thanks to Slaw for putting me on the right path. It turns out adding -Dstdout.encoding=UTF-8 to the Run configuration (VM arguments) in Eclipse fixed the issue.

When I didn't set that flag, the standard output encoding was Cp1252:

System.out.println("Standard Output Encoding: " + System.getProperty("stdout.encoding"));

Outputs "Standard Output Encoding: Cp1252" before adding "-Dstdout.encoding=UTF-8"
Outputs "Standard Output Encoding: UTF-8" after adding "-Dstdout.encoding=UTF-8"

Searching further on the internet, I found the following related resources:

  1. Eclipse bug report #530 in 2023.

  2. Note that this bug only applies to Java 18+ because of the changes in JEP 400. Hence why my Java (Corretto) update from 17 to 25 caused the issue.

So, with the flag, all my code works perfectly fine once more. No issues with the character encodings, logic, output, etc.

[Side note: The above bug was closed in July 2023, so this only happened because I've been neglecting my Eclipse updates for far longer than I'd thought.]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.