Verifying a string is UTF-8 encoded in Java

Question

There are plenty of to how to check if a string is UTF-8 encoded, for example:

public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}

The doc of java.lang.String#getBytes(java.nio.charset.Charset) says:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

Is it correct that it always returns correct UTF-8 bytes?
Does it make sense to perform such checks on String objects at all? Won't it always be returning true as a String object is already encoded?
As far as I understand such checks should be performed on bytes, not on String objects:

public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

But in this case I'm not sure I understand where I should take those butes from as getting it straight from the String object will no be correct.

A String is a sequence of characters. Encoding starts to matter when you want to transform a string to bytes, or bytes to a String, because the encoding defines which characters can be encoded, and how they are transformed to bytes (and vice-versa). So "checking if a String is UTF-8 encoded" doesn't make sense — JB Nizet
– JB Nizet, Commented Nov 28, 2019 at 17:55

Andreas · Accepted Answer · 2019-11-28 18:38:20Z

4

Is it correct that it always returns correct UTF-8 bytes?

Yes.

Does it make sense to perform such checks on String objects at all? Won't it always be returning true as a String object is already encoded?

Java strings use Unicode characters encoded in UTF-16. Since UTF-16 uses surrogate pairs, any unpaired surrogate is invalid, so Java strings can contain invalid char sequences.

Java strings can also contain characters that are unassigned in Unicode.

Which means that performing validation on a Java String makes sense, though it is very rarely done.

As far as I understand such checks should be performed on bytes, not on String objects.

Depending on the character set of the bytes, there is nothing to validate, e.g. character set CP437 maps all 256 byte values, so it cannot be invalid.

UTF-8 can be invalid, so you're correct that validating bytes is useful.

As the javadoc said, getBytes(Charset) always replaces malformed-input and unmappable-character sequences with the charset's default replacement byte.

That is because it does this:

CharsetEncoder encoder = charset.newEncoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE);

If you want to get the bytes, but fail on malformed-input and unmappable-character sequences, use CodingErrorAction.REPORT instead. Since that's actually the default, simply don't call the two onXxx() methods.

Example

String s = "\uD800"; // unpaired surrogate
System.out.println(Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));

That prints [63] which is a ?, i.e. the unpaired surrogate is malformed-input, so it was replaced with the replacement byte.

String s = "\uD800"; // unpaired surrogate

CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
ByteBuffer encoded = encoder.encode(CharBuffer.wrap(s.toCharArray()));
byte[] bytes = new byte[encoded.remaining()];
encoded.get(bytes);

System.out.println(Arrays.toString(bytes));

That causes MalformedInputException: Input length = 1 since the default malformed-input action is REPORT.

answered Nov 28, 2019 at 18:38

Andreas

160k13 gold badges164 silver badges262 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

void Over a year ago

Thanks for your answer. However, the last example doesn't throw an exception.

void Over a year ago

it looks like unpaired surrogate \uD800 will automatically be converted into question mark, that's why it doesn't throw an exception.

Andreas Over a year ago

@void I just ran the last block of code (the one with 6 statements), and it threw the exception, on both Java 7 and Java 13, on Windows. The 2 statement block of code right below the Example text doesn't throw exception, as already stated in the answer.

void Over a year ago

for me (osx, java 8) it throws an exception if encoding the array of chars without putting them into a string first, so this fails encoder.encode(CharBuffer.wrap(new char[]{'\uD800'})) while this succeeds encoder.encode(CharBuffer.wrap("\uD800".toCharArray()));

Remy Lebeau · Accepted Answer · 2019-12-04 00:28:01Z

Your function as shown makes no sense. As the documentation says:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

A String is comprised of UTF-16 encoded characters, not UTF-8. A String will NEVER be encoded in UTF-8, but it can ALWAYS be converted to UTF-8, so your function will ALWAYS return true. "UTF-8" is a standard encoding supported by all Java implementations, so getBytes("UTF-8") will NEVER throw UnsupportedEncodingException, which is raised only when an unsupported charset is used.

Your function would make more sense only if it took a byte[] as input instead. But even then, doing a double-encode and comparing the results is not efficient. As the documentation says:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.

For example:

public static boolean isUTF8(byte[] bytes){
    try{
        StandardCharset.UTF_8.newDecoder()
         .onMalformedInput(CodingErrorAction.REPORT)
         .onUnmappableCharacter(CodingErrorAction.REPORT)
         .decode(ByteBuffer.wrap(bytes)); 
    }
    catch (CharacterCodingException e){
        return false;
    }
    return true;
}

extreme_logic · Accepted Answer · 2023-03-23 08:48:19Z

Just use this

    public static boolean isUTF8(String input) {
        return StandardCharsets.UTF_8.newEncoder().canEncode(input);
    }

Internally the canEncode is already handling the REPORT and is handling the exceptions.

    private boolean canEncode(CharBuffer cb) {
        if (state == ST_FLUSHED)
            reset();
        else if (state != ST_RESET)
            throwIllegalStateException(state, ST_CODING);
        CodingErrorAction ma = malformedInputAction();
        CodingErrorAction ua = unmappableCharacterAction();
        try {
            onMalformedInput(CodingErrorAction.REPORT);
            onUnmappableCharacter(CodingErrorAction.REPORT);
            encode(cb);
        } catch (CharacterCodingException x) {
            return false;
        } finally {
            onMalformedInput(ma);
            onUnmappableCharacter(ua);
            reset();
        }
        return true;
    }

Collectives™ on Stack Overflow

Verifying a string is UTF-8 encoded in Java

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related