0

I was fetching data from a website using its API which was returning the data in JSON format. The issue was when there where some umlaut characters in the JSON. It would return its UNICODE, for e.g. Münich would be Mu\u0308nich.

When I passed this JSON string to the constructor of the org.codehaus.jettison.json.JSONObject, Mu\u0308nich was converted to Munich (n has an umlaut). Wrong.

I realized this very late (after fetching the entire data). Now I use the following method to convert it back to the Unicode form i.e. I pass Munich (n has an umlaut) to the method and it returns Mu\u0308nich.

I want to somehow convert this Mu\u0308nich to Münich. Any ideas?

Please note the conversion is needed only for u\u0308 to ü and o\u0308 to ö and a\u0308 to ä and so on.

Method used to convert back -

public static String escapeUnicode(String input) {
    StringBuilder b = new StringBuilder(input.length());
    Formatter f = new Formatter(b);
    for (char c : input.toCharArray()) {
        if (c < 128) {
            b.append(c);
        } else {
            f.format("\\u%04x", (int) c);
        }
    }
    return b.toString();
}
4
  • How can chen changed to nich, in Munich and Muchen? Commented Feb 12, 2013 at 14:29
  • @nhahtdh Typo. Sorry. Corrected. Commented Feb 12, 2013 at 14:30
  • I think the umlaut is dependent on how the program displays it (the umlaut is a separate character). Some program will fail to display the character correctly. Commented Feb 12, 2013 at 14:34
  • @nhahtdh - The problem is I already had the "correct" i.e. Münich in some cases. Now when I try to match it with this, n umlaut, data. I does not work. Commented Feb 12, 2013 at 14:41

1 Answer 1

3

These are called Diacritics and you can use Normalizer to combine diacritics into single unicode characters.

Use the normalize method and as Form NFKC. This will first decompose the full string into diacritics and then do a composition to return 'real' unicode umlauts.

So: 'München' stays 'München' and 'Mu\u0308nchen' will become 'München'

You then will have the string in a single format, not using diacritics anymore and easily portable and displayable.

If you work with texts from different platforms, some normalization is crucial or you will end up with the problems you described.

Sign up to request clarification or add additional context in comments.

4 Comments

He doesn't want to get rid of them. He wants it to be displayed correctly.
He said: I want to somehow convert this Mu\u0308nich to Münich. Any ideas? and the above method does exactly this.
Sorry for jumping the gun, but I think you should say "combine them into 1 character" rather than "get rid of".
Edited, you're right. Combine sounds better than 'get rid of'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.