3

I have a string "Château" with UTF-8 encoded & it gets converted to US-ASCII format as "Ch??teau"(in the underlying lib of my app)

Now, I want to get the original string "Château" back from "U-ASCII" converted string "Ch??teau". But, I am not able to get that using the below code.

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = (Read from InputStream) // say here it is "Château"
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes()) => Here, getBytes() uses the platform's default charset 'US-ASCII.

My application receives the byte array of US-ASCII encoded. On application side, even if I try to get UTF-8 string out of it, it's of no use. The conversion attempt still gives "Ch??teau".

String asciiString = "Ch??teau";
String originalString = new String(asciiString.getBytes("UTF-8"), "UTF-8");
System.out.println("orinalString: " + originalString);

The value of 'originalString" is still "Ch??teau".

Is this right way to do this ?

Thanks,

5
  • OK, so, first things first: there. is. no. such. thing. as. a. UTF-8. encoded. String. Java Strings store text data regardless of the character coding, and this means that your problem lies beyond the code you posted. Please paste the full code. Commented Dec 2, 2015 at 14:35
  • @fge Java's String (like C#'s, JavaScript's, …) is a counted sequence of UTF-16 code units, one or two of which encode a Unicode codepoint. (And, there are apparently characters in the computer world that aren't in the Unicode characters set.) Commented Dec 3, 2015 at 0:14
  • 1
    @TomBlodget: in the upcoming Java 9 next year, strings will not always store UTF-16 internally anymore. They will use ISO-8859-1 instead to compact memory usage when possible. Of course, public interfaces would still expect char and String methods to act on UTF-16 data, so there would have to be additional conversions performed at runtime to facilitate ISO-8859-1 based strings in UTF-16 based code logic. Commented Dec 3, 2015 at 2:20
  • @TomBlodget but that's an implementation detail. For all it's worth, elements of a String could be carrier pigeons; a String has no encoding. Commented Dec 3, 2015 at 5:56
  • @fge Not quite. It would be ideal if code that uses string was written that way but as soon as you take the length or use an index or other char related operation you have to deal with how many UTF-16 code units in each individual Unicode codepoint. Commented Dec 3, 2015 at 12:29

2 Answers 2

3

You can't. You lost information by converting to US-ASCII. You can't get back what was lost.

Sign up to request clarification or add additional context in comments.

Comments

1

Your code is receiving a UTF-8 encoded byte array, correctly converting it to a Java String, but is then converting that string to an ASCII encoded byte array. ASCII does not support the à and ¢ characters, which is why they are being converted to ?. Once that conversion has been done, there is no going back. ASCII is a subset of UTF-8, and ? in ASCII is also ? in UTF-8.

The solution is to stop converting to ASCII to begin with. You should convert back to UTF-8 instead:

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes("UTF-8"));

At least that way, for true ASCII characters, the receiver will never know the difference (since ASCII is a subset of UTF-8), and non-ASCII character will not be lost anymore. The receiver just needs to know to expect UTF-8 and not ASCII. And, your code will be more portable, since you will no longer be dependent on a platform-specific default charset (not all platforms use ASCII by default).

Of course, in your example, your StringBuilder is redundant since you are not adding anything else to it, so you could just remove it:

byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
onMessageReceived(rawString.getBytes("UTF-8"));

And then the String becomes redundant, too:

byte[] rawDataBuffer = ...; // Read from InputStream
onMessageReceived(rawDataBuffer);

If onMessageReceived() expects bytes as input, why waste overhead converting bytes to String to bytes again?

1 Comment

"The solution is to stop converting to ASCII to begin with" => It happens in underlying third-party library on which I don't have control. That's the problem. That's why I wanted to get UTF-8 string out of US-ASCII encoded byte array(@app level). Looks like, that's not possible.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.