Java getBytes UTF-8 encoding

Question

I am trying to deal with an encoding problem (I want to transform the special characters from a string into correct UTF-8 characters...):

When I execute this simple code:

System.out.println(new String("&eacute;".getBytes("UTF-8"), "UTF-8"));

In the console I expect: 'é' but I get

&eacute;

What? How do you think those characters should produce é? — Sotirios Delimanolis
– Sotirios Delimanolis, Commented Jan 7, 2015 at 21:05
The String instance creation expression you've used is effectively a no-op. — Sotirios Delimanolis
– Sotirios Delimanolis, Commented Jan 7, 2015 at 21:06
First you get bytes from a Unicode string é and then you convert it back to String in UTF-8 encoding... no wonder you get the output you get. — Jagger
– Jagger, Commented Jan 7, 2015 at 21:06
@Tyvain, maybe you should change the title of this question, since it's nothing to do with UTF-8 encoding, and it's actually about unescaping an HTML entity reference. — Dawood ibn Kareem
– Dawood ibn Kareem, Commented Jan 7, 2015 at 21:27

M A · Accepted Answer · 2015-01-07 21:09:13Z

7

é is the HTML entity reference for the é character, not the UTF-8 encoded string. To decode it, you can use Commons Lang's org.apache.commons.lang.StringEscapeUtils:

String decodedStr = StringEscapeUtils.unescapeHtml("&eacute;");

answered Jan 7, 2015 at 21:09

M A

73.2k14 gold badges150 silver badges182 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Tyvain Over a year ago

Perfect! PS: took commons-lang3-3.3.2 and used unescapeHtml4

Christian Hujer · Accepted Answer · 2015-01-07 21:11:38Z

1

Java Strings know nothing of SGML / XML / HTML5 entities. é is such an entity. It works in web browsers inside HTML because in one of the DTDs, or the HTML5 spec, it's defined that &eacute is the letter e with accent acute by mapping it to the corresponding unicode character entity é.

new String(someString.getBytes("UTF-8"), "UTF-8"); is a meaningless operation, it converts a String into bytes, with an encoding that can represent all meaningful characters, and converts it back into a String. It's the same thing as using someString directly, just you have a new object.

In order to get e with accent acute, you can do one of the following things:

Directly type it, like System.out.println("é");. This requires that your text editor and your Java compiler agree on the encoding of the source code file. If you're working in a project, it requires that everybody understands and agrees on a particular encoding. Recommended encoding these days certainly is UTF-8.
Use the Unicode character number. In the case of e acute it would be \u00e9.

P.S.: SGML / XML / HTML5 entities have nothing to do with UTF-8.

answered Jan 7, 2015 at 21:11

Christian Hujer

18.1k5 gold badges48 silver badges48 bronze badges

2 Comments

Tyvain Over a year ago

I find this "new String(someString.getBytes("UTF-8"), "UTF-8");" in many answers refering to the same problem... For example here: stackoverflow.com/questions/12253322/…

Maarten Bodewes Over a year ago

@Tyvain That was just to show how to correctly encode and decode using the same character set. It isn't meant to be a piece of code that should be used in production software.

Collectives™ on Stack Overflow

Java getBytes UTF-8 encoding

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related