How to detect which encoding was defined to a file?
I want something like this:
fs.getFileEncoding('C:/path/to/file.txt') // it returns 'UTF-8', 'CP-1252', ...
Is there a simple way to do it using a nodejs native function?
I don't think there is a "native Node.js function" that can do this.
The simplest solution I know is using an npm module like detect-file-encoding-and-language. As long as the input file is not too small it should work fine.
// Install plugin using npm
$ npm install detect-file-encoding-and-language
// Sample code
const languageEncoding = require("detect-file-encoding-and-language");
const pathToFile = "/home/username/documents/my-text-file.txt"
languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 0.97 } }
You can use an npm module that does exactly this: https://www.npmjs.com/package/detect-character-encoding
You can use it like this:
const fs = require('fs');
const detectCharacterEncoding = require('detect-character-encoding');
const fileBuffer = fs.readFileSync('file.txt');
const charsetMatch = detectCharacterEncoding(fileBuffer);
console.log(charsetMatch);
// {
// encoding: 'UTF-8',
// confidence: 60
// }
This is what I've been using, for a while now. YMMV. Hope it helps.
var fs = require('fs');
...
getFileEncoding( f ) {
var d = new Buffer.alloc(5, [0, 0, 0, 0, 0]);
var fd = fs.openSync(f, 'r');
fs.readSync(fd, d, 0, 5, 0);
fs.closeSync(fd);
// https://en.wikipedia.org/wiki/Byte_order_mark
var e = false;
if ( !e && d[0] === 0xEF && d[1] === 0xBB && d[2] === 0xBF)
e = 'utf8';
if (!e && d[0] === 0xFE && d[1] === 0xFF)
e = 'utf16be';
if (!e && d[0] === 0xFF && d[1] === 0xFE)
e = 'utf16le';
if (!e)
e = 'ascii';
return e;
}
e = 'utf8', remove utf8 check, then run the rest of the ladder without the !e && preamble (adding some elses/ternaries). Duck typing by BOM is a very practical idea for, say, reading files! @Falaen's answer, when no BOM or obvious tipoff, sniffs the whole file looking for telltale signs, which is clever, but perhaps overkill.return d[0] === 0xfe && d[1] === 0xff ? "utf16be" : d[0] === 0xff && d[1] === 0xfe ? "utf16le" : "utf8";, I don't think.What about jschardet? By the time of writing it has hundreds of thousands of downloads per week, it should work
var jschardet = require("jschardet")
// "àíàçã" in UTF-8
jschardet.detect("\xc3\xa0\xc3\xad\xc3\xa0\xc3\xa7\xc3\xa3")
// { encoding: "UTF-8", confidence: 0.9690625 }
// "次常用國字標準字體表" in Big5
jschardet.detect("\xa6\xb8\xb1\x60\xa5\xce\xb0\xea\xa6\x72\xbc\xd0\xb7\xc7\xa6\x72\xc5\xe9\xaa\xed")
// { encoding: "Big5", confidence: 0.99 }
// Martin Kühl
// jschardet.detectAll("\x3c\x73\x74\x72\x69\x6e\x67\x3e\x4d\x61\x72\x74\x69\x6e\x20\x4b\xfc\x68\x6c\x3c\x2f\x73\x74\x72\x69\x6e\x67\x3e")
// [
// {encoding: "windows-1252", confidence: 0.95},
// {encoding: "ISO-8859-2", confidence: 0.8796300205763055},
// {encoding: "SHIFT_JIS", confidence: 0.01}
// ]
I used encoding-japanese package, and it worked well.
Example :
var encoding = require('encoding-japanese');
var fileBuffer = fs.readFileSync('file.txt');
console.log(encoding.detect(fileBuffer))
Available Encodings:
It can be used both in node or browsers. Oh... And it has zero dependency.
To add to @Mark Longmire's answer, here is a more recent (TypeScript) version - note that it will only work if there is an optional BOM character which seems to be a limitation of a lot of solution on this topic:
import { closeSync, openSync, readSync } from 'fs';
/**
* Get the encoding of a file from an optional BOM character.
*
* This will only work if there is a BOM characters, and they are rarely used since they are optional.
*
* @see https://en.wikipedia.org/wiki/Byte_order_mark
*
* @param filePath - The path of a file on which to check encoding.
*
* @returns The file encoding if found, otherwise "unknown".
*/
function getFileEncoding(filePath: string): string {
const byteOrderMark = Buffer.alloc(5, 0); // Generate an empty BOM.
const fileDescriptor = openSync(filePath, 'r');
readSync(fileDescriptor, byteOrderMark, 0, 5, 0);
closeSync(fileDescriptor);
let encoding: string;
if (
!encoding &&
byteOrderMark[0] === 0xef &&
byteOrderMark[1] === 0xbb &&
byteOrderMark[2] === 0xbf
)
encoding = 'utf8';
if (!encoding && byteOrderMark[0] === 0xfe && byteOrderMark[1] === 0xff) encoding = 'utf16be';
if (!encoding && byteOrderMark[0] === 0xff && byteOrderMark[1] === 0xfe) encoding = 'utf16le';
if (!encoding) encoding = 'unknown';
return encoding;
}
fsis a native module ofnode