1

I am trying to deal with an encoding problem (I want to transform the special characters from a string into correct UTF-8 characters...):

When I execute this simple code:

System.out.println(new String("é".getBytes("UTF-8"), "UTF-8"));

In the console I expect: 'é' but I get

é 
7
  • 5
    What? How do you think those characters should produce é? Commented Jan 7, 2015 at 21:05
  • Are you thinking of HTML encoding? Commented Jan 7, 2015 at 21:06
  • The String instance creation expression you've used is effectively a no-op. Commented Jan 7, 2015 at 21:06
  • First you get bytes from a Unicode string é and then you convert it back to String in UTF-8 encoding... no wonder you get the output you get. Commented Jan 7, 2015 at 21:06
  • 1
    @Tyvain, maybe you should change the title of this question, since it's nothing to do with UTF-8 encoding, and it's actually about unescaping an HTML entity reference. Commented Jan 7, 2015 at 21:27

2 Answers 2

7

é is the HTML entity reference for the é character, not the UTF-8 encoded string. To decode it, you can use Commons Lang's org.apache.commons.lang.StringEscapeUtils:

String decodedStr = StringEscapeUtils.unescapeHtml("é");
Sign up to request clarification or add additional context in comments.

1 Comment

Perfect! PS: took commons-lang3-3.3.2 and used unescapeHtml4
1

Java Strings know nothing of SGML / XML / HTML5 entities. é is such an entity. It works in web browsers inside HTML because in one of the DTDs, or the HTML5 spec, it's defined that &eacute is the letter e with accent acute by mapping it to the corresponding unicode character entity é.

new String(someString.getBytes("UTF-8"), "UTF-8"); is a meaningless operation, it converts a String into bytes, with an encoding that can represent all meaningful characters, and converts it back into a String. It's the same thing as using someString directly, just you have a new object.

In order to get e with accent acute, you can do one of the following things:

  • Directly type it, like System.out.println("é");. This requires that your text editor and your Java compiler agree on the encoding of the source code file. If you're working in a project, it requires that everybody understands and agrees on a particular encoding. Recommended encoding these days certainly is UTF-8.
  • Use the Unicode character number. In the case of e acute it would be \u00e9.

P.S.: SGML / XML / HTML5 entities have nothing to do with UTF-8.

2 Comments

I find this "new String(someString.getBytes("UTF-8"), "UTF-8");" in many answers refering to the same problem... For example here: stackoverflow.com/questions/12253322/…
@Tyvain That was just to show how to correctly encode and decode using the same character set. It isn't meant to be a piece of code that should be used in production software.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.