0

I'm making my way through the Building Git book, but attempting to build my implementation in JavaScript. I'm stumped at the part of reading data in this file format that apparently only Ruby uses. Here is the excerpt from the book about this:

Note that we set the string’s encoding13 to ASCII_8BIT, which is Ruby’s way of saying that the string represents arbitrary binary data rather than text per se. Although the blobs we’ll be storing will all be ASCII-compatible source code, Git does allow blobs to be any kind of file, and certainly other kinds of objects — especially trees — will contain non-textual data. Setting the encoding this way means we don’t get surprising errors when the string is concatenated with others; Ruby sees that it’s binary data and just concatenates the bytes and won’t try to perform any character conversions.

Is there a way to emulate this encoding in JS?

or

Is there an alternative encoding I can use that JS and Ruby share that won't break anything?

Additionally, I've tried using Buffer.from(< text input >, 'binary') but it doesn't result in the same amount of bytes that the ruby ASCII-8BIT returns because in Node.js binary maps to ISO-8859-1.

1 Answer 1

3

Node certainly supports binary data, that's kind of what Buffer is for. However, it is crucial to know what you are converting into what. For example, the emoji "☺️" is encoded as six bytes in UTF-8:

// UTF-16 (JS) string to UTF-8 representation
Buffer.from('☺️', 'utf-8')
// => <Buffer e2 98 ba ef b8 8f>

If you happen to have a string that is not a native JS string (i.e. its encoding is different), you can use the encoding parameter to make Buffer interpret each character in a different manner (though only several different conversions are supported). For example, if we have a string of six characters that correspond to the six numbers above, it is not a smiley face for JavaScript, but Buffer.from can help us repackage it:

Buffer.from('\u00e2\u0098\u00ba\u00ef\u00b8\u008f', 'binary')
// => <Buffer e2 98 ba ef b8 8f>

JavaScript itself has only one encoding for its strings; thus, the parameter 'binary' is not really the binary encoding, but a mode of operation for Buffer.from, telling it that the string would have been a binary string if each character were one byte (however, since JavaScript internally uses UCS-2, each character is always represented by two bytes). Thus, if you use it on something that is not a string of characters in range from U+0000 to U+00FF, it will not do the correct thing, because there no such thing (GIGO principle). What it will actually do is get the lower byte of each character, which is probably not what you want:

Buffer.from('STUFF', 'binary')    // 8BIT range: U+0000 to U+00FF
// => <Buffer 42 59 54 45 53> ("STUFF")

Buffer.from('STUFF', 'binary')  // U+FF33 U+FF34 U+FF35 U+FF26 U+FF26
// => <Buffer 33 34 35 26 26> (garbage)

So, Node's Buffer structure exactly corresponds to Ruby's ASCII-8BIT "encoding" (binary is an encoding like "bald" is a hair style — it simply means no interpretation is attached to bytes; e.g. in ASCII, 65 means "A"; but in binary "encoding", 65 is just 65). Buffer.from with 'binary' lets you convert weird strings where one character corresponds to one byte into a Buffer. It is not the normal way of handling binary data; its function is to un-mess-up binary data when it has been read incorrectly into a string.

I assume you are reading a file as string, then trying to convert it to a Buffer — but your string is not actually in what Node considers to be the "binary" form (a sequence of characters in range from U+0000 to U+00FF; thus "in Node.js binary maps to ISO-8859-1" is not really true, because ISO-8859-1 is a sequence of characters in range from 0x00 to 0xFF — a single-byte encoding!).

Ideally, to have a binary representation of file contents, you would want to read the file as a Buffer in the first place (by using fs.readFile without an encoding), without ever touching a string.

(If my guess here is incorrect, please specify what the contents of your < text input > is, and how you obtain it, and in which case "it doesn't result in the same amount of bytes".)

EDIT: I seem to like typing Array.from too much. It's Buffer.from, of course.

Sign up to request clarification or add additional context in comments.

8 Comments

My issue is with > without ever touching a string. My code reads a file contents and stores it as a string then transforms it back to a buffer. I see i need to keep it as a buffer now. I'll give this a try and write back if I'm successful thank you!
Indeed. It makes much more sense to read as buffer (i.e. holding judgment on encoding), then transform to string if needed, than to read transforming into a string (possibly with incorrect encoding), then undo the damage if you wanted binary.
Also, my "binary maps to ISO-8859-1" comment comes straight from Node.js docs: nodejs.org/api/… Unless I'm misunderstanding something there but it states that 'binary' is an alias for 'latin8'
'binary' is an alias for 'latin1' and 'iso-8859-1' in the context of Buffer and "encoding". In general though (i.e. outside JavaScript), Latin1 and ISO-8859-1 are truly aliases of each other: an encoding that maps a single byte onto 256 specific characters used in Western Europe. They also happen to be equivalent to the first 256 codepoints of Unicode, though no encoding of Unicode represents them in the same way (UTF-8 will use two bytes for U+0080 ~ U+00FF, UTF-16 will use two bytes for all of them). Binary is, as I said, not an encoding, but a lack of encoding:
a sequence of bytes, not characters. A JS string like '\u00e2\u0098\u00ba\u00ef\u00b8\u008f' is still in UCS-2, not in Latin1, because no JavaScript string is ever in anything other than UCS-2 (which can be, with ES6 functions, treated as UTF-16). But in Node, Buffer.from(s, 'binary') will produce the same result as Buffer.from(s, 'latin1'), which is the sense in which the article calls them aliases, though they don't really handle Latin1 nor Binary. (EDIT: Oops, 'iso-8859-1' is not a valid "encoding" value for Buffer.from, ignore that.)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.