Check if a String is valid UTF-8 encoded in Java

Question

How can I check if a string is in valid UTF-8 format?

The simplest thing to do might be to decode it and encode it again. Check you get the same thing. This will be correct in almost every case. — Peter Lawrey
– Peter Lawrey, Commented Jul 8, 2011 at 9:08
@Peter that will not always work, because some characters can be encoded with different sequences of bytes. Both sequences of bytes would be correct, and encode the same characters, but the bytes are different. — Jesper
– Jesper, Commented Jul 8, 2011 at 9:27
@Jesper, If the data has been encoded with Java, it will be the same. It depends on what the OP is really trying to test. BTW in Java the \0 character is encoded as two bytes. ;) — Peter Lawrey
– Peter Lawrey, Commented Jul 8, 2011 at 9:58

james.garriss · Accepted Answer · 2015-10-02 14:32:28Z

41

Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.

Also only byte arrays can be UTF-8 encoded.

Here is a common case of UTF-8 conversions.

String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;

try 
{
    myBytes = myString.getBytes("UTF-8");
} 
catch (UnsupportedEncodingException e)
{
    e.printStackTrace();
    System.exit(-1);
}

for (int i=0; i < myBytes.length; i++) {
    System.out.println(myBytes[i]);
}

If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.

edited Oct 2, 2015 at 14:32

james.garriss

13.4k7 gold badges86 silver badges101 bronze badges

answered Jul 8, 2011 at 9:09

DArkO

16.2k12 gold badges66 silver badges94 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Nathan Ryan Over a year ago

Just as a clarification, an instance of String is not in UTF-16 encoding form, strictly speaking, since it permits ill-formed code unit sequences (in the form of isolated surrogate code units). It is, however, a Unicode 16-bit string.

rustyx Over a year ago

Even more strictly speaking, a Java String is also not a true Unicode 16-bit string, as it can contain surrogates for UCS4 (3- and 4-byte) characters.

james.garriss Over a year ago

ICU4J is another Java library that can help you detect the encoding of a byte array: site.icu-project.org

Remy Lebeau Over a year ago

Java strings use a UTF-16 based interface. It says so right in the documentation: "A String represents a string in the UTF-16 format". Surrogates are part of UTF-16, not UCS-2 (the predecessor of UTF-16). So yes, Java strings are 16bit Unicode strings, they are just using UTF-16 and not UCS-2 as the 16bit encoding.

Jarl Over a year ago

UTF-16 is not a 16-bit per character unicode string representation. UTF-16 is a variable-byte representation of a unicode string... Just like UTF-8 is a variable-byte representation of a unicode string. UCS2 on the other hand is a fixed 2-byte representation of a string, but does not cover all unicode code points.

Roman Vottner · Accepted Answer · 2017-07-05 17:05:30Z

The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html.

The StringConverter program starts by creating a String containing Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
AêñüC
To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. The getBytes method returns an array of bytes in UTF-8 format. To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported:
try {
    byte[] utf8Bytes = original.getBytes("UTF8");
    byte[] defaultBytes = original.getBytes();

    String roundTrip = new String(utf8Bytes, "UTF8");
    System.out.println("roundTrip = " + roundTrip);
    System.out.println();
    printBytes(utf8Bytes, "utf8Bytes");
    System.out.println();
    printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}
The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. Some Unicode characters translate into single bytes, others into pairs or triplets of bytes. The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java. Here is the printBytes method:
public static void printBytes(byte[] array, String name) {
    for (int k = 0; k < array.length; k++) {
        System.out.println(name + "[" + k + "] = " + "0x" +
            UnicodeFormatter.byteToHex(array[k]));
    }
}
The output of the printBytes method follows. Note that only the first and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43

Collectives™ on Stack Overflow

Check if a String is valid UTF-8 encoded in Java

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related