Getting UTF-8 encoded from US-ASCII encoded string

Question

I have a string "ChÃ¢teau" with UTF-8 encoded & it gets converted to US-ASCII format as "Ch??teau"(in the underlying lib of my app)

Now, I want to get the original string "ChÃ¢teau" back from "U-ASCII" converted string "Ch??teau". But, I am not able to get that using the below code.

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = (Read from InputStream) // say here it is "ChÃ¢teau"
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes()) => Here, getBytes() uses the platform's default charset 'US-ASCII.

My application receives the byte array of US-ASCII encoded. On application side, even if I try to get UTF-8 string out of it, it's of no use. The conversion attempt still gives "Ch??teau".

String asciiString = "Ch??teau";
String originalString = new String(asciiString.getBytes("UTF-8"), "UTF-8");
System.out.println("orinalString: " + originalString);

The value of 'originalString" is still "Ch??teau".

Is this right way to do this ?

Thanks,

OK, so, first things first: there. is. no. such. thing. as. a. UTF-8. encoded. String. Java Strings store text data regardless of the character coding, and this means that your problem lies beyond the code you posted. Please paste the full code. — fge
– fge, Commented Dec 2, 2015 at 14:35
@fge Java's String (like C#'s, JavaScript's, …) is a counted sequence of UTF-16 code units, one or two of which encode a Unicode codepoint. (And, there are apparently characters in the computer world that aren't in the Unicode characters set.) — Tom Blodget
– Tom Blodget, Commented Dec 3, 2015 at 0:14
@TomBlodget: in the upcoming Java 9 next year, strings will not always store UTF-16 internally anymore. They will use ISO-8859-1 instead to compact memory usage when possible. Of course, public interfaces would still expect char and String methods to act on UTF-16 data, so there would have to be additional conversions performed at runtime to facilitate ISO-8859-1 based strings in UTF-16 based code logic. — Remy Lebeau
– Remy Lebeau, Commented Dec 3, 2015 at 2:20
@TomBlodget but that's an implementation detail. For all it's worth, elements of a String could be carrier pigeons; a String has no encoding. — fge
– fge, Commented Dec 3, 2015 at 5:56
@fge Not quite. It would be ideal if code that uses string was written that way but as soon as you take the length or use an index or other char related operation you have to deal with how many UTF-16 code units in each individual Unicode codepoint. — Tom Blodget
– Tom Blodget, Commented Dec 3, 2015 at 12:29

Remy Lebeau · Accepted Answer · 2015-12-03 02:00:48Z

3

You can't. You lost information by converting to US-ASCII. You can't get back what was lost.

edited Dec 3, 2015 at 2:00

Remy Lebeau

610k36 gold badges516 silver badges875 bronze badges

answered Dec 2, 2015 at 14:26

Arnaud

17.6k3 gold badges34 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Remy Lebeau · Accepted Answer · 2015-12-03 17:46:51Z

1

Your code is receiving a UTF-8 encoded byte array, correctly converting it to a Java String, but is then converting that string to an ASCII encoded byte array. ASCII does not support the Ã and ¢ characters, which is why they are being converted to ?. Once that conversion has been done, there is no going back. ASCII is a subset of UTF-8, and ? in ASCII is also ? in UTF-8.

The solution is to stop converting to ASCII to begin with. You should convert back to UTF-8 instead:

StringBuilder masterBuffer = new StringBuilder();
byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
masterBuffer.append(rawString);
onMessageReceived(masterBuffer.toString().getBytes("UTF-8"));

At least that way, for true ASCII characters, the receiver will never know the difference (since ASCII is a subset of UTF-8), and non-ASCII character will not be lost anymore. The receiver just needs to know to expect UTF-8 and not ASCII. And, your code will be more portable, since you will no longer be dependent on a platform-specific default charset (not all platforms use ASCII by default).

Of course, in your example, your StringBuilder is redundant since you are not adding anything else to it, so you could just remove it:

byte[] rawDataBuffer = ...; // Read from InputStream
String rawString = new String(rawDataBuffer, "UTF-8");
onMessageReceived(rawString.getBytes("UTF-8"));

And then the String becomes redundant, too:

byte[] rawDataBuffer = ...; // Read from InputStream
onMessageReceived(rawDataBuffer);

If onMessageReceived() expects bytes as input, why waste overhead converting bytes to String to bytes again?

edited Dec 3, 2015 at 17:46

answered Dec 3, 2015 at 2:08

Remy Lebeau

610k36 gold badges516 silver badges875 bronze badges

1 Comment

bms Over a year ago

"The solution is to stop converting to ASCII to begin with" => It happens in underlying third-party library on which I don't have control. That's the problem. That's why I wanted to get UTF-8 string out of US-ASCII encoded byte array(@app level). Looks like, that's not possible.

Collectives™ on Stack Overflow

Getting UTF-8 encoded from US-ASCII encoded string

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related