1

I have Search on my site we frame the query and send in the Request and Response comes back from the vendor as JSON. The vendor crawls our site and capture the data from our site and send response. In Our design we are converting the JSON into java object using GSON. We use the UTF-8 as charset in the Meta.

I have a situation the response has some times Unicode encoding for the special characters based on the request. The browser is rendering this Unicode encoding for special characters in a strange way. How should i decode this Unicode encoding?

For example, for the special character 'ndash' i see in the response it encoded as '\u2013'

5
  • Are you aware of the difference between Unicode and UTF-8? Commented Feb 23, 2012 at 14:47
  • Frankly i don't know the Difference. IN the response i have seen for special character 'ndash' They are sending \u2013. I assumed it as unicode encoding Commented Feb 23, 2012 at 14:58
  • @pushya, it isn't. If do this in java "\u2013" you will get the UTF-16 encoded variant of that unicode character. Commented Feb 23, 2012 at 15:05
  • @JohanSjöberg How should i make the browser understand this '\u2013'? Commented Feb 23, 2012 at 15:13
  • @pushya, the browser won't understand that. Try to print the string where that statement is written and see what it produces Commented Feb 23, 2012 at 15:16

1 Answer 1

5

To clarify the differences between Unicode and a character encoding

Unicode

  • is an abstract concept aiming to identify all letters (currently > 110 000).

Character encoding

  • defines how a character can be represending by a sequence of bytes
  • one such encoding is utf-8 which uses 1-4 bytes to represent a Unicode character

A java String is always UTF-16. Hence when you construct a String you can use the following String constructor

new String(byte[], encoding)

The second argument should be the encoding the characters are in when the client are sending them. If you don't explicilty define an encoding, you will get the default system encoding, which you can examine using Charset.defaultCharset();.

You can manually set the default encoding as an argument when starting the JVM

-Dfile.encoding="utf-8"

Although rarely needed, you can also employ CharsetDecoder/CharsetEncoder.

Sign up to request clarification or add additional context in comments.

5 Comments

I did as per suggestion work fine on Eclipse console. but browser renders it still as weird characters. I see the '\u2013' is Converted to - writing above piece of code. but still html do not understand it. How should i encode this back to 'ndash?'
@pushya, Unicode is not known to browser. You need to use an encoded String (e.g., UTF-8). Then you may need to HTML-encode that for the browser to be able to display it correctly.
How should i encode this in HTML? Do i need to use Regular expression?
Can this HTML encoding handled In Java? I see this can be done in javascript
@pushya, You can use both. In java you could use StringEscapeUtils. In Javascript you could probably use escape/unescape methods.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.