How to detect file encoding in NodeJS?

Question

How to detect which encoding was defined to a file?

I want something like this:

fs.getFileEncoding('C:/path/to/file.txt') // it returns 'UTF-8', 'CP-1252', ...

Is there a simple way to do it using a nodejs native function?

fs is a native module of node

Muhammad Usman
– Muhammad Usman

2018-04-26 14:44:53 +00:00
Commented Apr 26, 2018 at 14:44 — Muhammad Usman
– Muhammad Usman, Commented Apr 26, 2018 at 14:44

Falaen · Accepted Answer · 2021-04-21 11:10:48Z

7

I don't think there is a "native Node.js function" that can do this.

The simplest solution I know is using an npm module like detect-file-encoding-and-language. As long as the input file is not too small it should work fine.

// Install plugin using npm

$ npm install detect-file-encoding-and-language

// Sample code

const languageEncoding = require("detect-file-encoding-and-language");

const pathToFile = "/home/username/documents/my-text-file.txt"

languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 0.97 } }

edited Apr 21, 2021 at 11:10

answered Mar 24, 2021 at 13:44

Falaen

3834 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Eric Ly · Accepted Answer · 2018-04-26 14:45:29Z

6

You can use an npm module that does exactly this: https://www.npmjs.com/package/detect-character-encoding

You can use it like this:

const fs = require('fs');
const detectCharacterEncoding = require('detect-character-encoding');

const fileBuffer = fs.readFileSync('file.txt');
const charsetMatch = detectCharacterEncoding(fileBuffer);

console.log(charsetMatch);
// {
//   encoding: 'UTF-8',
//   confidence: 60
// }

answered Apr 26, 2018 at 14:45

Eric Ly

2,1231 gold badge22 silver badges27 bronze badges

2 Comments

Joshua Dannemann Over a year ago

The project you mention does not work on Windows. Is there another tool that works well?

LUser Over a year ago

It doesn't appear to install well in Linux either. That bug was reported in 2017. I would consider this a waste of time.

Mark Longmire · Accepted Answer · 2019-05-17 14:17:30Z

5

This is what I've been using, for a while now. YMMV. Hope it helps.


var fs = require('fs');
...
getFileEncoding( f ) {

    var d = new Buffer.alloc(5, [0, 0, 0, 0, 0]);
    var fd = fs.openSync(f, 'r');
    fs.readSync(fd, d, 0, 5, 0);
    fs.closeSync(fd);

    // https://en.wikipedia.org/wiki/Byte_order_mark
    var e = false;
    if ( !e && d[0] === 0xEF && d[1] === 0xBB && d[2] === 0xBF)
        e = 'utf8';
    if (!e && d[0] === 0xFE && d[1] === 0xFF)
        e = 'utf16be';
    if (!e && d[0] === 0xFF && d[1] === 0xFE)
        e = 'utf16le';
    if (!e)
        e = 'ascii';

    return e;


}

answered May 17, 2019 at 14:17

Mark Longmire

1,1608 silver badges12 bronze badges

2 Comments

ruffin Over a year ago

Depending on use case & how sure I need to be -- BOM sniffing suggests not very -- I'd probably start with e = 'utf8', remove utf8 check, then run the rest of the ladder without the !e && preamble (adding some elses/ternaries). Duck typing by BOM is a very practical idea for, say, reading files! @Falaen's answer, when no BOM or obvious tipoff, sniffs the whole file looking for telltale signs, which is clever, but perhaps overkill.

ruffin Over a year ago

Yeah, since UTF-8 is essentially a superset of at least 7-bit ASCII, if you're just looking for a practical "how should I read this?", you don't lose any utility with return d[0] === 0xfe && d[1] === 0xff ? "utf16be" : d[0] === 0xff && d[1] === 0xfe ? "utf16le" : "utf8";, I don't think.

João Pimentel Ferreira · Accepted Answer · 2022-01-09 22:56:33Z

5

What about jschardet? By the time of writing it has hundreds of thousands of downloads per week, it should work

var jschardet = require("jschardet")

// "àíàçã" in UTF-8
jschardet.detect("\xc3\xa0\xc3\xad\xc3\xa0\xc3\xa7\xc3\xa3")
// { encoding: "UTF-8", confidence: 0.9690625 }

// "次常用國字標準字體表" in Big5
jschardet.detect("\xa6\xb8\xb1\x60\xa5\xce\xb0\xea\xa6\x72\xbc\xd0\xb7\xc7\xa6\x72\xc5\xe9\xaa\xed")
// { encoding: "Big5", confidence: 0.99 }

// Martin Kühl
// jschardet.detectAll("\x3c\x73\x74\x72\x69\x6e\x67\x3e\x4d\x61\x72\x74\x69\x6e\x20\x4b\xfc\x68\x6c\x3c\x2f\x73\x74\x72\x69\x6e\x67\x3e")
// [
//   {encoding: "windows-1252", confidence: 0.95},
//   {encoding: "ISO-8859-2", confidence: 0.8796300205763055},
//   {encoding: "SHIFT_JIS", confidence: 0.01}
// ]

answered Jan 9, 2022 at 22:56

João Pimentel Ferreira

16.5k14 gold badges99 silver badges131 bronze badges

1 Comment

KLMN Over a year ago

for Angular user, here is a useful example stackblitz.com/edit/angular-detect-encoding

Donovan P · Accepted Answer · 2020-05-25 23:28:08Z

4

I used encoding-japanese package, and it worked well.

Example :

var encoding = require('encoding-japanese');
var fileBuffer = fs.readFileSync('file.txt');
console.log(encoding.detect(fileBuffer))

Available Encodings:

'UTF32' (detect only)
'UTF16'
'UTF16BE'
'UTF16LE'
'BINARY' (detect only)
'ASCII' (detect only)
'JIS'
'UTF8'
'EUCJP'
'SJIS'
'UNICODE' (JavaScript Unicode Array)

It can be used both in node or browsers. Oh... And it has zero dependency.

answered May 25, 2020 at 23:28

Donovan P

6416 silver badges10 bronze badges

1 Comment

simon Over a year ago

I tested this with some simple test strings containing German umlaut characters, and it did not work. I would recommend jschardet, which seems to work much better.

Nicolas Bouvrette · Accepted Answer · 2021-12-24 19:33:07Z

To add to @Mark Longmire's answer, here is a more recent (TypeScript) version - note that it will only work if there is an optional BOM character which seems to be a limitation of a lot of solution on this topic:

import { closeSync, openSync, readSync } from 'fs';

/**
 * Get the encoding of a file from an optional BOM character.
 *
 * This will only work if there is a BOM characters, and they are rarely used since they are optional.
 *
 * @see https://en.wikipedia.org/wiki/Byte_order_mark
 *
 * @param filePath - The path of a file on which to check encoding.
 *
 * @returns The file encoding if found, otherwise "unknown".
 */
function getFileEncoding(filePath: string): string {
  const byteOrderMark = Buffer.alloc(5, 0); // Generate an empty BOM.
  const fileDescriptor = openSync(filePath, 'r');
  readSync(fileDescriptor, byteOrderMark, 0, 5, 0);
  closeSync(fileDescriptor);

  let encoding: string;

  if (
    !encoding &&
    byteOrderMark[0] === 0xef &&
    byteOrderMark[1] === 0xbb &&
    byteOrderMark[2] === 0xbf
  )
    encoding = 'utf8';
  if (!encoding && byteOrderMark[0] === 0xfe && byteOrderMark[1] === 0xff) encoding = 'utf16be';
  if (!encoding && byteOrderMark[0] === 0xff && byteOrderMark[1] === 0xfe) encoding = 'utf16le';
  if (!encoding) encoding = 'unknown';

  return encoding;
}

Collectives™ on Stack Overflow

How to detect file encoding in NodeJS?

6 Answers 6

Comments

2 Comments

2 Comments

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Comments

2 Comments

2 Comments

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related