There are plenty of to how to check if a string is UTF-8 encoded, for example:
public static boolean isUTF8(String s){
try{
byte[]bytes = s.getBytes("UTF-8");
}catch(UnsupportedEncodingException e){
e.printStackTrace();
System.exit(-1);
}
return true;
}
The doc of java.lang.String#getBytes(java.nio.charset.Charset) says:
This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.
- Is it correct that it always returns correct UTF-8 bytes?
- Does it make sense to perform such checks on
Stringobjects at all? Won't it always be returningtrueas a String object is already encoded? - As far as I understand such checks should be performed on bytes, not on
Stringobjects:
public static final boolean isUTF8(final byte[] inputBytes) {
final String converted = new String(inputBytes, StandardCharsets.UTF_8);
final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
return Arrays.equals(inputBytes, outputBytes);
}
But in this case I'm not sure I understand where I should take those butes from as getting it straight from the String object will no be correct.