How to decode the Unicode encoding in java?

Question

I have Search on my site we frame the query and send in the Request and Response comes back from the vendor as JSON. The vendor crawls our site and capture the data from our site and send response. In Our design we are converting the JSON into java object using GSON. We use the UTF-8 as charset in the Meta.

I have a situation the response has some times Unicode encoding for the special characters based on the request. The browser is rendering this Unicode encoding for special characters in a strange way. How should i decode this Unicode encoding?

For example, for the special character 'ndash' i see in the response it encoded as '\u2013'

Frankly i don't know the Difference. IN the response i have seen for special character 'ndash' They are sending \u2013. I assumed it as unicode encoding — pushya
– pushya, Commented Feb 23, 2012 at 14:58
@pushya, it isn't. If do this in java "\u2013" you will get the UTF-16 encoded variant of that unicode character. — Johan Sjöberg
– Johan Sjöberg, Commented Feb 23, 2012 at 15:05
@JohanSjöberg How should i make the browser understand this '\u2013'? — pushya
– pushya, Commented Feb 23, 2012 at 15:13
@pushya, the browser won't understand that. Try to print the string where that statement is written and see what it produces — Johan Sjöberg
– Johan Sjöberg, Commented Feb 23, 2012 at 15:16

Johan Sjöberg · Accepted Answer · 2012-02-23 14:57:44Z

5

To clarify the differences between Unicode and a character encoding

Unicode

is an abstract concept aiming to identify all letters (currently > 110 000).

Character encoding

defines how a character can be represending by a sequence of bytes
one such encoding is utf-8 which uses 1-4 bytes to represent a Unicode character

A java String is always UTF-16. Hence when you construct a String you can use the following String constructor

new String(byte[], encoding)

The second argument should be the encoding the characters are in when the client are sending them. If you don't explicilty define an encoding, you will get the default system encoding, which you can examine using Charset.defaultCharset();.

You can manually set the default encoding as an argument when starting the JVM

-Dfile.encoding="utf-8"

Although rarely needed, you can also employ CharsetDecoder/CharsetEncoder.

edited Feb 23, 2012 at 14:57

answered Feb 23, 2012 at 14:51

Johan Sjöberg

49.4k22 gold badges135 silver badges150 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

pushya Over a year ago

I did as per suggestion work fine on Eclipse console. but browser renders it still as weird characters. I see the '\u2013' is Converted to - writing above piece of code. but still html do not understand it. How should i encode this back to 'ndash?'

Johan Sjöberg Over a year ago

@pushya, Unicode is not known to browser. You need to use an encoded String (e.g., UTF-8). Then you may need to HTML-encode that for the browser to be able to display it correctly.

pushya Over a year ago

How should i encode this in HTML? Do i need to use Regular expression?

pushya Over a year ago

Can this HTML encoding handled In Java? I see this can be done in javascript

Johan Sjöberg Over a year ago

@pushya, You can use both. In java you could use StringEscapeUtils. In Javascript you could probably use escape/unescape methods.

Collectives™ on Stack Overflow

How to decode the Unicode encoding in java?

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related