2

I need a function that can check if a file or blob object is valid UTF-8. I can get the text and check for � characters, but if the string has that character to begin with, the function would mark it as invalid.

function isUTF8(blob) {
  return new Promise(async resolve => {
    const text = await blob.text();
    resolve(!~text.indexOf("�"));
  });
}

// "�" is valid utf-8 but the function returns false
isUTF8(new Blob(["�"])).then(console.log);

// returns true
isUTF8(new Blob(["example"])).then(console.log);

1 Answer 1

3

You can use the TextDecoder API:

async function isUTF8(blob) {
  const decoder = new TextDecoder('utf-8', { fatal: true });
  const buffer = await blob.arrayBuffer();
  try {
    decoder.decode(buffer);
  } catch (e) {
    if (e instanceof TypeError)
      return false;
    throw e;
  }
  return true;
}

(async () => {

console.log(await isUTF8(new Blob(
  [new Uint8Array([0x80])]))); // false
console.log(await isUTF8(new Blob(
  [new Uint8Array([0xef, 0xbf, 0xbd])]))); // true
console.log(await isUTF8(new Blob(
  ["\ufffd"]))); // true
console.log(await isUTF8(new Blob(
  ["example"]))); // true

})().catch(e => console.warn(e));

The above loads the entire Blob into an ArrayBuffer for simplicity. If memory-efficiency becomes an issue, you may look into using the .stream() method to process the Blob in parts, without holding it in memory in its entirety.

Sign up to request clarification or add additional context in comments.

1 Comment

Just a note regarding the last sentence: using the .decode(buf, { stream }) option, it becomes very important to set the stream option to false when processing the last chunk (usually it's not necessary, but fatal would ignore some surrogates otherwise.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.