Telegram Bot: a set of characters break out of HTML escape

Question

I have a game telegram bot which uses first name - last name pairs to spell out a top chart of users in a chat by their score. Screenshot example below:

So, every user has a link to them. The actual code to generate a link:

from html import escape as html_escape

EscapeType = typing.Literal['html']


def escape_string(s: str, escape: EscapeType | None = None) -> str:
    if escape == 'html':
        s = html_escape(s)
    elif escape is None:
        pass
    else:
        raise NotImplementedError(escape)
    return s


def getter(d):
    if isinstance(d, User):
        return lambda attr: getattr(d, attr, None)
    elif hasattr(d, '__getitem__') and hasattr(d, 'get'):
        return lambda attr: d.get(attr, None)
    else:
        return lambda attr: getattr(d, attr, None)


def personal_appeal(user: User | dict, escape: EscapeType | None = 'html') -> str:
    get = getter(user)

    if full_name := get("full_name"):
        appeal = full_name
    elif name := get("name"):
        appeal = name
    elif first_name := get("first_name"):
        if last_name := get("last_name"):
            appeal = f"{first_name} {last_name}"
        else:
            appeal = first_name
    elif username := get('username'):
        appeal = username
    else:
        raise ValueError(user)

    return escape_string(appeal, escape)


def user_mention(id: int | User, name: str | None = None, escape: EscapeType | None = 'html') -> str:
    if isinstance(id, User):
        user = id
        id = user.id
        name = personal_appeal(user)

    name = escape_string(name, escape=escape)

    if name is None:
        name = "N/A"

    if id is not None:
        return f'<a href="tg://user?id={id}">{name}</a>'
    else:
        return name

Basically, this code generates a link from a user name - user ID pair. As you can see, the name is HTML escaped by default.

There is, however, one user, which breaks this code somehow, by their unusual first name, and here is the actual sequence of characters they use:

'$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝'

Screenshot of the result of the same code run against this first name:

As you can see, telegram seems to be lost in the markup. The link escapes onto other unrelated characters, and the <b> tag is broken, too.

This is the actual string which is being sent to the telegram servers (except for the ids, those I redacted out):

🔝🏆 <u>Рейтинг игроков чата</u>:

🥇 1. <a href="tg://user?id=1">andy alexanderson</a> (<b>40</b>)
🥈 2. <a href="tg://user?id=2">$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝</a> (<b>40</b>)
🤡 3. <a href="tg://user?id=3">: )</a> (<b>0</b>)

⏱️ <i>Рейтинг составлен 1 минуту назад</i>.
⏭️ <i>Следующее обновление через 28 минут</i>.

Seems like the only odd thing in this markup is the nickname, though.

Is this a Telegram bug?

Can something be done to mitigate this, so that my users wouldn't be able to escape the HTML markup? I am willing to sacrifice the correctness of their name representation (due to the fact that such users willingly obfuscate their names), but I need to somehow be able to tell apart something which would break the markup.

Or maybe there is some UTF-16 <-> UTF-8 encoding stuff going on that I'm missing out on?

Framework used: python-telegram-bot. Python version: 3.10.12.

I'm surprised they're not called tony the pony. This is zalgo text (or maybe some other name, but that's one way to generate it). I'm afraid I can't help with this problem itself, though — roganjosh
– roganjosh, Commented Dec 7, 2023 at 19:28
@roganjosh thanks, at least you hinted me in the direction of stripping those characters out, once I know how that thing is called, lol — winwin
– winwin, Commented Dec 7, 2023 at 20:23
@roganjosh, hey, I found a way to strip those characters out, if you're interested: I posted the answer. — winwin
– winwin, Commented Dec 7, 2023 at 20:49
@roganjosh you're right, didn't think from this perspective. Changed it back. — winwin
– winwin, Commented Dec 7, 2023 at 21:04

winwin · Accepted Answer · 2023-12-07 20:49:24Z

1

As @roganjosh pointed out, this turns out to be a so-called "zalgo" sequence of characters. To remove the zalgo characters, I first found this decode function from an old JS library called lunicode.js. I found it by reversing this zalgo-text encoder-decoder website.

It turned out to be a very simple function, so here it is written in python:

def remove_zalgo(txt: str) -> str:
    return ''.join([
        char
        for char in txt
        if ord(char) < 768 or ord(char) > 865
    ])

Now my markup doesn't break, and there are no zalgo characters in names of my users. I think, it's a win :)

answered Dec 7, 2023 at 20:49

winwin

2,1241 gold badge21 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

roganjosh Over a year ago

Have you checked that this doesn't strip all diacritics, even the ones that are sensible?

winwin Over a year ago

@roganjosh unfortunately, it does strip all of them. It's an easy way out, and it seems sufficient for my app. I don't think distinguishing between sensible and nonsensical diacritics is a solvable task, though, if you work outside language context.

roganjosh Over a year ago

I don't like it overall. For your implementation, I bet you could have a heuristic to judge in some way on the number of diacritics you will accept before you strip them all. roganjosh is just my handle but I take it quite personally when it's misused (people calling me Rogan etc.). I shouldn't, but I do anyway

winwin Over a year ago

@roganjosh you're absolutely right, and I agree with you 100% on that you shouldn't misuse people's handles, and I should respect that in my app. However, so far my app's users are mostly Russian speakers, and the Russian language doesn't use diacritics at all, so I'll be using this fix, until the bug is fixed in Telegram bot API( I reported it here). I wouldn't use this fix myself, if I had a way to get rid of the root cause, which seems to be the way how Telegram servers fail to parse HTML tags, if diacritics are involved.

secemp9 · Accepted Answer · 2023-12-07 21:50:02Z

1

You can use Unidecode:

from unidecode import unidecode
print(unidecode('$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝<'))
# output:
# $. ..  <

And with a more meaningful input:

from unidecode import unidecode
print(unidecode('ᴮᴵᴳᴮᴵᴿᴰ'))
# output:
# BIGBIRD

answered Dec 7, 2023 at 21:50

secemp9

5532 gold badges6 silver badges23 bronze badges

3 Comments

winwin Over a year ago

it's a pretty good lib for transliteration, but it turns everything into latin ascii characters, afaik. So it's too much of collateral damage for users who use anything but latin. But thanks for a good lib!

secemp9 Over a year ago

there is a lot more options but I admit it's hard to find something that work. It works fine for names though I think? By the way, if you want to look up more solution on this topic, this is related to "unicode normalization". There a lot more related keywords such as "unicode vector attack", etc which are also related on some degree @winwin

winwin Over a year ago

those aren't names, unfortunately. People mostly set as their names whatever they can come up with in Telegram. And even if they were, I wouldn't want to transliterate them, because my app is mostly cyrillic.

Collectives™ on Stack Overflow

Telegram Bot: a set of characters break out of HTML escape

2 Answers 2

4 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related