Convert UTF-8 Unicode string to ASCII Unicode escaped String

Question

I need to convert unicode string to string which have non-ascii characters encoded in unicode. For example, string "漢字 Max" should be presented as "\u6F22\u5B57 Max".

What I have tried:

Differenct combinations of

new String(sourceString.getBytes(encoding1), encoding2)
Apache StringEscapeUtils which escapes also ascii chars like double-quote

StringEscapeUtils.escapeJava(source)

Is there an easy way to encode such string? Ideally only Java 6 SE or Apache Commons should be used to achieve desired result.

Any reason not to just implement it yourself? It wouldn't take terribly long. How performance-critical is this? Do you need to worry about surrogate pairs? (Are you happy for them to be encoded as a pair of \u escape sequences?) — Jon Skeet
– Jon Skeet, Commented Jan 27, 2015 at 17:42
Using the right terminology should improve your chances of finding a solution: what you want is not encoded in Unicode; it uses Java-specific Unicode escape form. — Marko Topolnik
– Marko Topolnik, Commented Jan 27, 2015 at 17:42
@Jon Skeet, I just wanted not to re-invent a wheel. I from the answer how easy it is. — Taras Velykyy
– Taras Velykyy, Commented Jan 27, 2015 at 17:53
There are multiple string literal formats that use \u escapes, but handle aspects such as surrogates and ASCII escapes differently. If you are only generating user-readable text maybe you don't care and “any old format with \u in” is good enough, but if you're eg creating JSON, you'll need to use the exact rules for JSON escaping. — bobince
– bobince, Commented Jan 28, 2015 at 11:31

Marko Topolnik · Accepted Answer · 2015-01-27 17:49:27Z

7

This is the kind of simple code Jon Skeet had in mind in his comment:

final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
  final char ch = in.charAt(i);
  if (ch <= 127) out.append(ch);
  else out.append("\\u").append(String.format("%04x", (int)ch));
}
System.out.println(out.toString());

As Jon said, surrogate pairs will be represented as a pair of \u escapes.

answered Jan 27, 2015 at 17:49

Marko Topolnik

201k31 gold badges336 silver badges455 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Marko Topolnik Over a year ago

That was exactly my first thought, however it depends on the context---sometimes you actually want newlines to stay newlines.

bobince Over a year ago

Note that since backslashes are not escaped, this encoding scheme is ambiguous and does not round-trip. eg for input é \u00E9 the output is \u00E9 \u00E9.

Marko Topolnik Over a year ago

@bobince I would not call that ambiguous since é and \u00E9 are synonyms under this system.

bobince Over a year ago

bug:feature :: ambiguous:synonyms. If you need to know the original data from the escaped form then bug, if you don't care then feature :-)

Marko Topolnik Over a year ago

@bobince In any case, it's not this code's fault: it's what Java specifies. Round-tripping string literals is certainly a non-goal for Java.

Community · Accepted Answer · 2020-06-20 09:12:55Z

0

Guava Escaper Based Solution:

This escapes any non-ASCII characters into Unicode escape sequences.

import static java.lang.String.format;    
import com.google.common.escape.CharEscaper;

public class NonAsciiUnicodeEscaper extends CharEscaper
{
    @Override
    protected char[] escape(final char c)
    {
        if (c >= 32 && c <= 127) { return new char[]{c}; }
        else { return format("\\u%04x", (int) c).toCharArray(); }
    }
}

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Feb 8, 2016 at 22:49

user177800

Collectives™ on Stack Overflow

Convert UTF-8 Unicode string to ASCII Unicode escaped String

2 Answers 2

5 Comments

Guava Escaper Based Solution:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Guava Escaper Based Solution:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related