Unicode to String in java but tricky

Question

I was fetching data from a website using its API which was returning the data in JSON format. The issue was when there where some umlaut characters in the JSON. It would return its UNICODE, for e.g. Münich would be Mu\u0308nich.

When I passed this JSON string to the constructor of the org.codehaus.jettison.json.JSONObject, Mu\u0308nich was converted to Munich (n has an umlaut). Wrong.

I realized this very late (after fetching the entire data). Now I use the following method to convert it back to the Unicode form i.e. I pass Munich (n has an umlaut) to the method and it returns Mu\u0308nich.

I want to somehow convert this Mu\u0308nich to Münich. Any ideas?

Please note the conversion is needed only for u\u0308 to ü and o\u0308 to ö and a\u0308 to ä and so on.

Method used to convert back -

public static String escapeUnicode(String input) {
    StringBuilder b = new StringBuilder(input.length());
    Formatter f = new Formatter(b);
    for (char c : input.toCharArray()) {
        if (c < 128) {
            b.append(c);
        } else {
            f.format("\\u%04x", (int) c);
        }
    }
    return b.toString();
}

I think the umlaut is dependent on how the program displays it (the umlaut is a separate character). Some program will fail to display the character correctly. — nhahtdh
– nhahtdh, Commented Feb 12, 2013 at 14:34
@nhahtdh - The problem is I already had the "correct" i.e. Münich in some cases. Now when I try to match it with this, n umlaut, data. I does not work. — JHS
– JHS, Commented Feb 12, 2013 at 14:41

Neet · Accepted Answer · 2013-02-12 14:43:08Z

3

These are called Diacritics and you can use Normalizer to combine diacritics into single unicode characters.

Use the normalize method and as Form NFKC. This will first decompose the full string into diacritics and then do a composition to return 'real' unicode umlauts.

So: 'München' stays 'München' and 'Mu\u0308nchen' will become 'München'

You then will have the string in a single format, not using diacritics anymore and easily portable and displayable.

If you work with texts from different platforms, some normalization is crucial or you will end up with the problems you described.

edited Feb 12, 2013 at 14:43

answered Feb 12, 2013 at 14:37

Neet

4,06717 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

nhahtdh Over a year ago

He doesn't want to get rid of them. He wants it to be displayed correctly.

Neet Over a year ago

He said: I want to somehow convert this Mu\u0308nich to Münich. Any ideas? and the above method does exactly this.

nhahtdh Over a year ago

Sorry for jumping the gun, but I think you should say "combine them into 1 character" rather than "get rid of".

Neet Over a year ago

Edited, you're right. Combine sounds better than 'get rid of'.

Collectives™ on Stack Overflow

Unicode to String in java but tricky

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related