NodeJS UTF8 Encoding A Buffer Then Decoding That UTF8 String Produces A Buffer With Different Content

Question

I typed this into the nodejs console

new Buffer(new Buffer([0xde]).toString('utf8'), 'utf8')

and it prints out

<Buffer ef bf bd>

After looking at the docs it seems that this would produce an identical buffer. I'm creating a utf8 encoded string from a buffer whose contents consist of one byte (0xde) then using that utf8 encoded string to create a buffer. Am I missing something here?

mscdex · Accepted Answer · 2015-02-11 18:59:00Z

4

For encodings that can be multi-byte, you cannot expect to get the same bytes back that you started with in all cases. In the case of UTF-8, some characters require more than one byte to be represented properly.

In your example, 0xde exceeds 0x7f which is the largest value for a character that can be represented by a single byte. So when you then call .toString('utf8'), node sees that it only has one byte and instead returns the UTF-8 character \uFFFD (0xef, 0xbf, 0xbd in hex) which is used to denote an unknown/unrepresentable character. Then reading back in this "replacement character" value back into a new Buffer is no problem, as it is a valid UTF-8 character.

edited Feb 11, 2015 at 18:59

answered Feb 11, 2015 at 18:50

mscdex

107k15 gold badges201 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jtrfe Over a year ago

Thanks for the answer. For my purposes it sounds like I need to use another type of string encoding option like hex or base64.

Collectives™ on Stack Overflow

NodeJS UTF8 Encoding A Buffer Then Decoding That UTF8 String Produces A Buffer With Different Content

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related