0

I'm using this PHP function to wrap emojis in arbitrary HTML tags, which allows me to style them on web pages, since CSS3 does not (yet?) directly support styling of multi-byte characters, at least I haven't found any CSS selector for that purpose:

function wrap_emojis($s, $str_before, $str_after) {
    $default_encoding = mb_regex_encoding();
    mb_regex_encoding('UTF-8');
    $s = mb_ereg_replace('([^\x{0000}-\x{FFFF}])', $str_before . '\\1' . $str_after, $s);
    mb_regex_encoding($default_encoding);
    return $s;
}

The issue is that it works for lower range emojis such as 😎 (01F60E) but it does not work for higher range emojis such as ☀️ (2600FE0F)

Any ideas how to fix the PHP function so that it works with 4 bytes range as well?

e.g. if I call wrap_emojis("zzz☀️zzz", "A", "B"); Expected result: "zzzA☀️Bzzz". Actual result: "zzz☀️zzz". But it works with lower range emojis as noted in the question, e.g. wrap_emojis("zzz😎zzz", "A", "B") returns: "zzzA😎Bzzz"

11
  • Can you give an example call to wrap_emojis with the parameters and what to expect as result? Commented Aug 22, 2023 at 14:42
  • wrap_emojis("XYZ☀️XYZ", "A", "B"); Expected result: "XYZA☀️BXYZ". Actual result: "XYZ☀️XYZ". But it works with lower range emojis as noted in the question. Commented Aug 22, 2023 at 14:44
  • This seems to work: 3v4l.org/gR5o1 Note the little eye icon in right top corner of the output to see the processed HTML. Commented Aug 22, 2023 at 14:45
  • 1
    FWIW that's not a "high range" anything, that's a 2-codepoint sequence. It's a u2600 which is ☀ and a uFE0F "variation selector" to produce the glyph that's actually rendered. Commented Aug 22, 2023 at 19:21
  • 1
    As much as you might dislike @bobblebubble's linked regex, that's the reality of emoji. It is a vast mishmash of ranges scattered across unicode, a portal to "Combining Mark Hell", and only gets more complicated every time the Unicode Consortium updates the spec. Commented Aug 22, 2023 at 20:27

1 Answer 1

0

Alright, so it wasn't that hard, I just had to write the RegEx matching 2 groups of 2 bytes (mb4 with "variation selector") OR (when none is found) then any character not in lower 2 bytes range. Pretty sure it will cause issues in foreign languages, but in English, it works great!

$s = mb_ereg_replace('([\x{0100}-\x{FFFF}][\x{0000}-\x{FFFF}]|[^\x{0000}-\x{FFFF}])', $str_before . '\\1' . $str_after, $s);

Hope it enlightens other people on here. Cheers 🤣

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.