How to convert unicode characters to HTML numeric entities using plain Javascript

Question

I'm trying to convert innerHTML with special characters into their original &#...; entity values but can't seem to get it working for unicode values. Where am I going wrong?

The code is trying to take "Orig" - encode it and place it into "Copy"....

Orig: 1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:🙂

Copy: 1:🙂�__2:𝌆�__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:🙂�

... but obviously the dreaded black diamonds are appearing!

if (!String.prototype.codePointAt) {
  String.prototype.codePointAt = function(pos) {
    pos = isNaN(pos) ? 0 : pos;
    var str = String(this),
      code = str.charCodeAt(pos),
      next = str.charCodeAt(pos + 1);
    // If a surrogate pair
    if (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF) {
      return ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000;
    }
    return code;
  };
}

/**
 * Encodes special html characters
 * @param string
 * @return {*}
 */
function html_encode(s) {
  var ret_val = '';
  for (var i = 0; i < s.length; i++) {
    if (s.codePointAt(i) > 127) {
      ret_val += '&#' + s.codePointAt(i) + ';';
    } else {
      ret_val += s.charAt(i);
    }
  }
  return ret_val;

}

var v = html_encode(document.getElementById('orig').innerHTML);
document.getElementById('copy').innerHTML = v;
document.getElementById('values').value = v;
//console.log(v);

div {
    padding:10px;
    border:solid 1px grey;
}
textarea {
    width:calc(100% - 30px);
    height:50px;
    padding:10px;
}

Orig:<div id='orig'>1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div>
Copy:<div id='copy'></div>
Values:<textarea id='values'></textarea>

(A jsfiddle is available at https://jsfiddle.net/Abeeee/k6e4svqa/24/)

I've been through the various suggestions on How to convert characters to HTML entities using plain JavaScript, including the he.js which looks the most favourable, but when I downloaded that script it doesn't compile (Unexpected Token around line 32: .. var encodeMap = <%= encodeMap %>;).

I'm not sure where to go with this.

But why would you need to do this? Just make sure your HTML file is saved as utf8 document (which will almost certainly already be the case if you use any of the even mildly popular modern text/code editor), and make sure it contains <meta charset="utf-8"> so the browser renders it correctly. — Mike 'Pomax' Kamermans
– Mike 'Pomax' Kamermans, Commented Oct 15, 2021 at 16:43
developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/… — Teemu
– Teemu, Commented Oct 15, 2021 at 16:45
Note sure how these comments help. Mike - if you try jsfiddle.net/Abeeee/k6e4svqa/28 and simply copy and paste a smiley face onto the end of the "Orig" field then the problem continues - with the meta tag in place. Teemu ... why are you showing this link? — abe1432181
– abe1432181, Commented Oct 15, 2021 at 16:59

u936293 · Accepted Answer · 2021-10-16 01:41:08Z

2

Javascript strings are UTF-16. A character in the surrogate range takes up two 16-bit words. The length property of a string is the count of the number of 16-bit words. Thus "🙂".length will return 2.

codePointAt(i) is not the ith character, but the ith 16-bit word. Hence, a surrogate character will appear over two consecutive codePointAt invocations. From the specs, if "🙂".toString(0) is the high surrogate, the function will return the code point value, ie 128578, but "🙂".toString(1) will return only the lower surrogate 56898, that black diamond.

Thus you need to skip one position if codePointAt returns a high surrogate.

Following the example in the specs, instead of iterating through each 16-bit word in the string, use a method that loops through each character. for let (char in aString) {} does just that.

function html_encode(s) {
  var ret_val = '';
  for (let char of s) {
    const code = char.codePointAt(0);
    if (code > 127) {
      ret_val += '&#' + code + ';';
    } else {
      ret_val += char;
    }
  }
  return ret_val;
}

let v = html_encode(document.getElementById('orig').innerHTML);
document.getElementById('copy').innerHTML = v;
document.getElementById('values').value = v;

div {
    padding:10px;
    border:solid 1px grey;
}
textarea {
    width:calc(100% - 30px);
    height:50px;
    padding:10px;
}

Orig:<div id='orig'>1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div>
Copy:<div id='copy'></div>
Values:<textarea id='values'></textarea>

edited Oct 16, 2021 at 1:41

answered Oct 15, 2021 at 17:15

u936293

16.5k34 gold badges125 silver badges225 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

abe1432181 Over a year ago

Thanks Old Geezer - the "if (code > 65535) i++;" did the trick.🙂

u936293 Over a year ago

@user1432181 I have modified the code to iterate through each character in the string instead of each 16-bit element. I am not sure if codePointAt handles endianess to ensure that the high surrogate always comes before the lower surrogate. I think it does.

Collectives™ on Stack Overflow

How to convert unicode characters to HTML numeric entities using plain Javascript

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related