17

How to detect which encoding was defined to a file?

I want something like this:

fs.getFileEncoding('C:/path/to/file.txt') // it returns 'UTF-8', 'CP-1252', ...

Is there a simple way to do it using a nodejs native function?

1
  • 1
    fs is a native module of node Commented Apr 26, 2018 at 14:44

6 Answers 6

7

I don't think there is a "native Node.js function" that can do this.

The simplest solution I know is using an npm module like detect-file-encoding-and-language. As long as the input file is not too small it should work fine.

// Install plugin using npm

$ npm install detect-file-encoding-and-language
// Sample code

const languageEncoding = require("detect-file-encoding-and-language");

const pathToFile = "/home/username/documents/my-text-file.txt"

languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 0.97 } }
Sign up to request clarification or add additional context in comments.

Comments

6

You can use an npm module that does exactly this: https://www.npmjs.com/package/detect-character-encoding

You can use it like this:

const fs = require('fs');
const detectCharacterEncoding = require('detect-character-encoding');

const fileBuffer = fs.readFileSync('file.txt');
const charsetMatch = detectCharacterEncoding(fileBuffer);

console.log(charsetMatch);
// {
//   encoding: 'UTF-8',
//   confidence: 60
// }

2 Comments

The project you mention does not work on Windows. Is there another tool that works well?
It doesn't appear to install well in Linux either. That bug was reported in 2017. I would consider this a waste of time.
5

This is what I've been using, for a while now. YMMV. Hope it helps.


var fs = require('fs');
...
getFileEncoding( f ) {

    var d = new Buffer.alloc(5, [0, 0, 0, 0, 0]);
    var fd = fs.openSync(f, 'r');
    fs.readSync(fd, d, 0, 5, 0);
    fs.closeSync(fd);

    // https://en.wikipedia.org/wiki/Byte_order_mark
    var e = false;
    if ( !e && d[0] === 0xEF && d[1] === 0xBB && d[2] === 0xBF)
        e = 'utf8';
    if (!e && d[0] === 0xFE && d[1] === 0xFF)
        e = 'utf16be';
    if (!e && d[0] === 0xFF && d[1] === 0xFE)
        e = 'utf16le';
    if (!e)
        e = 'ascii';

    return e;

}

2 Comments

Depending on use case & how sure I need to be -- BOM sniffing suggests not very -- I'd probably start with e = 'utf8', remove utf8 check, then run the rest of the ladder without the !e && preamble (adding some elses/ternaries). Duck typing by BOM is a very practical idea for, say, reading files! @Falaen's answer, when no BOM or obvious tipoff, sniffs the whole file looking for telltale signs, which is clever, but perhaps overkill.
Yeah, since UTF-8 is essentially a superset of at least 7-bit ASCII, if you're just looking for a practical "how should I read this?", you don't lose any utility with return d[0] === 0xfe && d[1] === 0xff ? "utf16be" : d[0] === 0xff && d[1] === 0xfe ? "utf16le" : "utf8";, I don't think.
5

What about jschardet? By the time of writing it has hundreds of thousands of downloads per week, it should work

var jschardet = require("jschardet")

// "àíàçã" in UTF-8
jschardet.detect("\xc3\xa0\xc3\xad\xc3\xa0\xc3\xa7\xc3\xa3")
// { encoding: "UTF-8", confidence: 0.9690625 }

// "次常用國字標準字體表" in Big5
jschardet.detect("\xa6\xb8\xb1\x60\xa5\xce\xb0\xea\xa6\x72\xbc\xd0\xb7\xc7\xa6\x72\xc5\xe9\xaa\xed")
// { encoding: "Big5", confidence: 0.99 }

// Martin Kühl
// jschardet.detectAll("\x3c\x73\x74\x72\x69\x6e\x67\x3e\x4d\x61\x72\x74\x69\x6e\x20\x4b\xfc\x68\x6c\x3c\x2f\x73\x74\x72\x69\x6e\x67\x3e")
// [
//   {encoding: "windows-1252", confidence: 0.95},
//   {encoding: "ISO-8859-2", confidence: 0.8796300205763055},
//   {encoding: "SHIFT_JIS", confidence: 0.01}
// ]

1 Comment

for Angular user, here is a useful example stackblitz.com/edit/angular-detect-encoding
4

I used encoding-japanese package, and it worked well.

Example :

var encoding = require('encoding-japanese');
var fileBuffer = fs.readFileSync('file.txt');
console.log(encoding.detect(fileBuffer))

Available Encodings:

  • 'UTF32' (detect only)
  • 'UTF16'
  • 'UTF16BE'
  • 'UTF16LE'
  • 'BINARY' (detect only)
  • 'ASCII' (detect only)
  • 'JIS'
  • 'UTF8'
  • 'EUCJP'
  • 'SJIS'
  • 'UNICODE' (JavaScript Unicode Array)

It can be used both in node or browsers. Oh... And it has zero dependency.

1 Comment

I tested this with some simple test strings containing German umlaut characters, and it did not work. I would recommend jschardet, which seems to work much better.
1

To add to @Mark Longmire's answer, here is a more recent (TypeScript) version - note that it will only work if there is an optional BOM character which seems to be a limitation of a lot of solution on this topic:

import { closeSync, openSync, readSync } from 'fs';

/**
 * Get the encoding of a file from an optional BOM character.
 *
 * This will only work if there is a BOM characters, and they are rarely used since they are optional.
 *
 * @see https://en.wikipedia.org/wiki/Byte_order_mark
 *
 * @param filePath - The path of a file on which to check encoding.
 *
 * @returns The file encoding if found, otherwise "unknown".
 */
function getFileEncoding(filePath: string): string {
  const byteOrderMark = Buffer.alloc(5, 0); // Generate an empty BOM.
  const fileDescriptor = openSync(filePath, 'r');
  readSync(fileDescriptor, byteOrderMark, 0, 5, 0);
  closeSync(fileDescriptor);

  let encoding: string;

  if (
    !encoding &&
    byteOrderMark[0] === 0xef &&
    byteOrderMark[1] === 0xbb &&
    byteOrderMark[2] === 0xbf
  )
    encoding = 'utf8';
  if (!encoding && byteOrderMark[0] === 0xfe && byteOrderMark[1] === 0xff) encoding = 'utf16be';
  if (!encoding && byteOrderMark[0] === 0xff && byteOrderMark[1] === 0xfe) encoding = 'utf16le';
  if (!encoding) encoding = 'unknown';

  return encoding;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.