Conversion between character encoding in java

Question

I cannot find out how to do the conversion below

String s = "HÃ¤r har du!  â\u0080\u0093 Hur vÃ¤l kan du snacka?";
t = convert(s);
// t should be "Här har du! â Hur väl kan du snacka?"

I cannot find how to translate s into t. Anybody knows how to do this in Java?

Use UTF-8. Seriously—why does anyone not use unicode these days? — DaoWen
– DaoWen, Commented Dec 4, 2014 at 14:08
This is a strange one. The Ã¤ characters are obviously UTF-8 bytes coerced to characters, but the â is correct, and I have no idea what \u0080\u0093 are supposed to be, as they are not a valid UTF-8 byte sequence, and they wouldn't even make sense in the windows-1252 charset. In summary, this string doesn't seem to be derived from any one charset. — VGR
– VGR, Commented Dec 4, 2014 at 14:32
After further research, it seems to be intended to be an EN dash-- see someone else's similar problem — errantlinguist
– errantlinguist, Commented Dec 4, 2014 at 14:46
this basically looks like an already corrupted string value. your problem lies before you got the String s. wile you may be able to patch things together after the fact, fixing your actual cause is the correct solution. where are you getting this string from in the first place? — jtahlborn
– jtahlborn, Commented Dec 4, 2014 at 15:34

Semih Eker · Accepted Answer · 2014-12-04 19:47:21Z

3

Try sthg like this;

     String s = "HÃ¤r har du!  â\u0080\u0093 Hur vÃ¤l kan du snacka?";        
     byte[] bytes = s.getBytes("ISO-8859-1");
     String str  = new String(bytes, "UTF-8");

Output is ;

    Här har du!  – Hur väl kan du snacka?

For below code;

public static void main (String[] args) throws java.lang.Exception
{
     String s = "HÃ¤r har du!  â\u0080\u0093 Hur vÃ¤l kan du snacka?";        
     byte[] bytes = s.getBytes("ISO-8859-1");
     String str  = new String(bytes, "UTF-8");
     System.out.println(str);
}

edited Dec 4, 2014 at 19:47

answered Dec 4, 2014 at 14:13

Semih Eker

2,4071 gold badge21 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

VGR Over a year ago

Your first two lines of code convert the string to bytes using UTF-8 and then back to a String using UTF-8, which means they are useless and can be removed. Your final line, new String(latin1), will use your platform's default charset, which is a very bad idea. It happened to work for you, but it's hardly reliable.

VGR Over a year ago

That looks correct, although it's better to use StandardCharsets.ISO_8859_1 and StandardCharsets.UTF_8 instead of String literals, both because Strings are subject to typos and because using standard charsets removes the need to catch an exception.

Hurve Over a year ago

Thx very much! This answered my question. The code is executed on an app server. It works perfectly, but I'll see if I can set the default encoding in the app server configuration, because of your warning.

jtahlborn · Accepted Answer · 2014-12-04 15:41:13Z

1

As i already mentioned in my comment, it looks like your String s is already corrupted. the correct solution is to fix wherever you got s from in the first place. it seems like you are interpreting what is really a "UTF-8" encoded String using some single byte encoding ("ISO8859-1" seems to work on your test string).

Provided you haven't already lost data in the original string corruption, you can somewhat patch your current string using:

    String s = "HÃ¤r har du!  â\u0080\u0093 Hur vÃ¤l kan du snacka?";        
    byte[] b = s.getBytes("ISO-8859-1");
    String t = new String(b, "UTF-8");

answered Dec 4, 2014 at 15:41

jtahlborn

53.8k5 gold badges80 silver badges122 bronze badges

Collectives™ on Stack Overflow

Conversion between character encoding in java

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related