0

I was able to figure out how to convert a Unicode string to an ASCII string using the following code. (Credits are in the code)

    //create a string using unicode that says "hello" when printed to console
    String unicode = "\u0068" + "\u0065" + "\u006c" + "\u006c" + "\u006f";
    System.out.println(unicode);
    System.out.println("");

    /* Test code for converting unicode to ASCII
     * Taken from http://stackoverflow.com/questions/15356716/how-can-i-convert-unicode-string-to-ascii-in-java
     * Will be commented out later after tested and implemented.
     */
    //String s = "口水雞 hello Ä";

    //replace String s with String unicode for conversion
    String s1 = Normalizer.normalize(unicode, Normalizer.Form.NFKD);
    String regex = Pattern.quote("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");

    String s2 = new String(s1.replaceAll(regex, "").getBytes("ascii"), "ascii");

    System.out.println(s2);
    System.out.println(unicode.length() == s2.length());
    //End of Test code that was implemented

Now, my problem and curiosity has gotten the better of me. I've attempted googling seeing as I don't have the best knowledge with Java.

My question is, Is it possible to convert an ASCII string to a UTF format? Especially UTF-16. (I say UTF-16 because I know how similar UTF-8 is to ASCII and it would not be necessary to convert to UTF-8 from ASCII)

Thanks in advance!

1 Answer 1

1

Java strings use UTF-16 as internal format and it's not relevant, as the String class takes care of it. You will see the difference only in two cases:

  1. when examining the String as an array of bytes (see below). This what happens in C all the time, but it's not the case with more modern languages with proper distinction between a string and an array of bytes (e.g. Java or Python 3.x).
  2. when converting to a more restrictive encoding (which is what you did, UTF-8 to ASCII), as some characters will need to be replaced.

If you want to encode the content to UTF-16 before writing to a file (or equivalent), you can do it with:

String data = "TEST";
OutputStream output = new FileOutputStream("filename.txt");
output.write(data.getBytes("UTF-16"));
output.close();

And the resulting file will contain:

0000000: feff 0054 0045 0053 0054                 ...T.E.S.T

Which is UTF-16 with BOM bytes at the beginning.

Sign up to request clarification or add additional context in comments.

3 Comments

java Strings use UTF-16 internally, not UTF-8.
Perfect, that gives me a good understanding on what's happening with the encoding.
Yes, UTF-16. Corrected.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.