2

I'm trying to convert innerHTML with special characters into their original &#...; entity values but can't seem to get it working for unicode values. Where am I going wrong?

The code is trying to take "Orig" - encode it and place it into "Copy"....

Orig: 1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:🙂

Copy: 1:🙂�__2:𝌆�__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:⚠️__11:⚠__12:🙂�

... but obviously the dreaded black diamonds are appearing!

if (!String.prototype.codePointAt) {
  String.prototype.codePointAt = function(pos) {
    pos = isNaN(pos) ? 0 : pos;
    var str = String(this),
      code = str.charCodeAt(pos),
      next = str.charCodeAt(pos + 1);
    // If a surrogate pair
    if (0xD800 <= code && code <= 0xDBFF && 0xDC00 <= next && next <= 0xDFFF) {
      return ((code - 0xD800) * 0x400) + (next - 0xDC00) + 0x10000;
    }
    return code;
  };
}

/**
 * Encodes special html characters
 * @param string
 * @return {*}
 */
function html_encode(s) {
  var ret_val = '';
  for (var i = 0; i < s.length; i++) {
    if (s.codePointAt(i) > 127) {
      ret_val += '&#' + s.codePointAt(i) + ';';
    } else {
      ret_val += s.charAt(i);
    }
  }
  return ret_val;

}

var v = html_encode(document.getElementById('orig').innerHTML);
document.getElementById('copy').innerHTML = v;
document.getElementById('values').value = v;
//console.log(v);
div {
    padding:10px;
    border:solid 1px grey;
}
textarea {
    width:calc(100% - 30px);
    height:50px;
    padding:10px;
}
Orig:<div id='orig'>1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div>
Copy:<div id='copy'></div>
Values:<textarea id='values'></textarea>

(A jsfiddle is available at https://jsfiddle.net/Abeeee/k6e4svqa/24/)

I've been through the various suggestions on How to convert characters to HTML entities using plain JavaScript, including the he.js which looks the most favourable, but when I downloaded that script it doesn't compile (Unexpected Token around line 32: .. var encodeMap = <%= encodeMap %>;).

I'm not sure where to go with this.

3
  • But why would you need to do this? Just make sure your HTML file is saved as utf8 document (which will almost certainly already be the case if you use any of the even mildly popular modern text/code editor), and make sure it contains <meta charset="utf-8"> so the browser renders it correctly. Commented Oct 15, 2021 at 16:43
  • developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/… Commented Oct 15, 2021 at 16:45
  • Note sure how these comments help. Mike - if you try jsfiddle.net/Abeeee/k6e4svqa/28 and simply copy and paste a smiley face onto the end of the "Orig" field then the problem continues - with the meta tag in place. Teemu ... why are you showing this link? Commented Oct 15, 2021 at 16:59

1 Answer 1

2

Javascript strings are UTF-16. A character in the surrogate range takes up two 16-bit words. The length property of a string is the count of the number of 16-bit words. Thus "🙂".length will return 2.

codePointAt(i) is not the ith character, but the ith 16-bit word. Hence, a surrogate character will appear over two consecutive codePointAt invocations. From the specs, if "🙂".toString(0) is the high surrogate, the function will return the code point value, ie 128578, but "🙂".toString(1) will return only the lower surrogate 56898, that black diamond.

Thus you need to skip one position if codePointAt returns a high surrogate.

Following the example in the specs, instead of iterating through each 16-bit word in the string, use a method that loops through each character. for let (char in aString) {} does just that.

function html_encode(s) {
  var ret_val = '';
  for (let char of s) {
    const code = char.codePointAt(0);
    if (code > 127) {
      ret_val += '&#' + code + ';';
    } else {
      ret_val += char;
    }
  }
  return ret_val;
}

let v = html_encode(document.getElementById('orig').innerHTML);
document.getElementById('copy').innerHTML = v;
document.getElementById('values').value = v;
div {
    padding:10px;
    border:solid 1px grey;
}
textarea {
    width:calc(100% - 30px);
    height:50px;
    padding:10px;
}
Orig:<div id='orig'>1:🙂__2:𝌆__3:ß__4:Ü__5:X__6:Y__7:팆__8:Z__9:⚠️__10:&#9888;&#65039;__11:&#9888;__12:&#128578;</div>
Copy:<div id='copy'></div>
Values:<textarea id='values'></textarea>

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks Old Geezer - the "if (code > 65535) i++;" did the trick.🙂
@user1432181 I have modified the code to iterate through each character in the string instead of each 16-bit element. I am not sure if codePointAt handles endianess to ensure that the high surrogate always comes before the lower surrogate. I think it does.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.