2

I need to convert unicode string to string which have non-ascii characters encoded in unicode. For example, string "漢字 Max" should be presented as "\u6F22\u5B57 Max".

What I have tried:

  1. Differenct combinations of

    new String(sourceString.getBytes(encoding1), encoding2)

  2. Apache StringEscapeUtils which escapes also ascii chars like double-quote

    StringEscapeUtils.escapeJava(source)

Is there an easy way to encode such string? Ideally only Java 6 SE or Apache Commons should be used to achieve desired result.

4
  • 2
    Any reason not to just implement it yourself? It wouldn't take terribly long. How performance-critical is this? Do you need to worry about surrogate pairs? (Are you happy for them to be encoded as a pair of \u escape sequences?) Commented Jan 27, 2015 at 17:42
  • Using the right terminology should improve your chances of finding a solution: what you want is not encoded in Unicode; it uses Java-specific Unicode escape form. Commented Jan 27, 2015 at 17:42
  • @Jon Skeet, I just wanted not to re-invent a wheel. I from the answer how easy it is. Commented Jan 27, 2015 at 17:53
  • There are multiple string literal formats that use \u escapes, but handle aspects such as surrogates and ASCII escapes differently. If you are only generating user-readable text maybe you don't care and “any old format with \u in” is good enough, but if you're eg creating JSON, you'll need to use the exact rules for JSON escaping. Commented Jan 28, 2015 at 11:31

2 Answers 2

7

This is the kind of simple code Jon Skeet had in mind in his comment:

final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
  final char ch = in.charAt(i);
  if (ch <= 127) out.append(ch);
  else out.append("\\u").append(String.format("%04x", (int)ch));
}
System.out.println(out.toString());

As Jon said, surrogate pairs will be represented as a pair of \u escapes.

Sign up to request clarification or add additional context in comments.

5 Comments

That was exactly my first thought, however it depends on the context---sometimes you actually want newlines to stay newlines.
Note that since backslashes are not escaped, this encoding scheme is ambiguous and does not round-trip. eg for input é \u00E9 the output is \u00E9 \u00E9.
@bobince I would not call that ambiguous since é and \u00E9 are synonyms under this system.
bug:feature :: ambiguous:synonyms. If you need to know the original data from the escaped form then bug, if you don't care then feature :-)
@bobince In any case, it's not this code's fault: it's what Java specifies. Round-tripping string literals is certainly a non-goal for Java.
0

Guava Escaper Based Solution:

This escapes any non-ASCII characters into Unicode escape sequences.

import static java.lang.String.format;    
import com.google.common.escape.CharEscaper;

public class NonAsciiUnicodeEscaper extends CharEscaper
{
    @Override
    protected char[] escape(final char c)
    {
        if (c >= 32 && c <= 127) { return new char[]{c}; }
        else { return format("\\u%04x", (int) c).toCharArray(); }
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.