1

I have a game telegram bot which uses first name - last name pairs to spell out a top chart of users in a chat by their score. Screenshot example below:

normal markup

So, every user has a link to them. The actual code to generate a link:

from html import escape as html_escape

EscapeType = typing.Literal['html']


def escape_string(s: str, escape: EscapeType | None = None) -> str:
    if escape == 'html':
        s = html_escape(s)
    elif escape is None:
        pass
    else:
        raise NotImplementedError(escape)
    return s


def getter(d):
    if isinstance(d, User):
        return lambda attr: getattr(d, attr, None)
    elif hasattr(d, '__getitem__') and hasattr(d, 'get'):
        return lambda attr: d.get(attr, None)
    else:
        return lambda attr: getattr(d, attr, None)


def personal_appeal(user: User | dict, escape: EscapeType | None = 'html') -> str:
    get = getter(user)

    if full_name := get("full_name"):
        appeal = full_name
    elif name := get("name"):
        appeal = name
    elif first_name := get("first_name"):
        if last_name := get("last_name"):
            appeal = f"{first_name} {last_name}"
        else:
            appeal = first_name
    elif username := get('username'):
        appeal = username
    else:
        raise ValueError(user)

    return escape_string(appeal, escape)


def user_mention(id: int | User, name: str | None = None, escape: EscapeType | None = 'html') -> str:
    if isinstance(id, User):
        user = id
        id = user.id
        name = personal_appeal(user)

    name = escape_string(name, escape=escape)

    if name is None:
        name = "N/A"

    if id is not None:
        return f'<a href="tg://user?id={id}">{name}</a>'
    else:
        return name

Basically, this code generates a link from a user name - user ID pair. As you can see, the name is HTML escaped by default.

There is, however, one user, which breaks this code somehow, by their unusual first name, and here is the actual sequence of characters they use:

'$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝'

Screenshot of the result of the same code run against this first name:

bad markup

As you can see, telegram seems to be lost in the markup. The link escapes onto other unrelated characters, and the <b> tag is broken, too.

This is the actual string which is being sent to the telegram servers (except for the ids, those I redacted out):

🔝🏆 <u>Рейтинг игроков чата</u>:

🥇 1. <a href="tg://user?id=1">andy alexanderson</a> (<b>40</b>)
🥈 2. <a href="tg://user?id=2">$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝</a> (<b>40</b>)
🤡 3. <a href="tg://user?id=3">: )</a> (<b>0</b>)

⏱️ <i>Рейтинг составлен 1 минуту назад</i>.
⏭️ <i>Следующее обновление через 28 минут</i>.

Seems like the only odd thing in this markup is the nickname, though.

Is this a Telegram bug?

Can something be done to mitigate this, so that my users wouldn't be able to escape the HTML markup? I am willing to sacrifice the correctness of their name representation (due to the fact that such users willingly obfuscate their names), but I need to somehow be able to tell apart something which would break the markup.

Or maybe there is some UTF-16 <-> UTF-8 encoding stuff going on that I'm missing out on?

Framework used: python-telegram-bot. Python version: 3.10.12.

4
  • 1
    I'm surprised they're not called tony the pony. This is zalgo text (or maybe some other name, but that's one way to generate it). I'm afraid I can't help with this problem itself, though Commented Dec 7, 2023 at 19:28
  • @roganjosh thanks, at least you hinted me in the direction of stripping those characters out, once I know how that thing is called, lol Commented Dec 7, 2023 at 20:23
  • 1
    @roganjosh, hey, I found a way to strip those characters out, if you're interested: I posted the answer. Commented Dec 7, 2023 at 20:49
  • 1
    @roganjosh you're right, didn't think from this perspective. Changed it back. Commented Dec 7, 2023 at 21:04

2 Answers 2

1

As @roganjosh pointed out, this turns out to be a so-called "zalgo" sequence of characters. To remove the zalgo characters, I first found this decode function from an old JS library called lunicode.js. I found it by reversing this zalgo-text encoder-decoder website.

It turned out to be a very simple function, so here it is written in python:

def remove_zalgo(txt: str) -> str:
    return ''.join([
        char
        for char in txt
        if ord(char) < 768 or ord(char) > 865
    ])

Now my markup doesn't break, and there are no zalgo characters in names of my users. I think, it's a win :)

Sign up to request clarification or add additional context in comments.

4 Comments

Have you checked that this doesn't strip all diacritics, even the ones that are sensible?
@roganjosh unfortunately, it does strip all of them. It's an easy way out, and it seems sufficient for my app. I don't think distinguishing between sensible and nonsensical diacritics is a solvable task, though, if you work outside language context.
I don't like it overall. For your implementation, I bet you could have a heuristic to judge in some way on the number of diacritics you will accept before you strip them all. roganjosh is just my handle but I take it quite personally when it's misused (people calling me Rogan etc.). I shouldn't, but I do anyway
@roganjosh you're absolutely right, and I agree with you 100% on that you shouldn't misuse people's handles, and I should respect that in my app. However, so far my app's users are mostly Russian speakers, and the Russian language doesn't use diacritics at all, so I'll be using this fix, until the bug is fixed in Telegram bot API( I reported it here). I wouldn't use this fix myself, if I had a way to get rid of the root cause, which seems to be the way how Telegram servers fail to parse HTML tags, if diacritics are involved.
1

You can use Unidecode:

from unidecode import unidecode
print(unidecode('$̴̢̛̙͈͚̎̓͆͑.̸̱̖͑͒ ̧̡͉̺̬͎̯.̸̧̢̠̺̮̬͙͛̓̀̐́.̵̦͑̉͌͌̎͘ ̞ ̷̡͈̤̓̀͋͗͊̈́̑̽͝<'))
# output:
# $. ..  <

And with a more meaningful input:

from unidecode import unidecode
print(unidecode('ᴮᴵᴳᴮᴵᴿᴰ'))
# output:
# BIGBIRD

3 Comments

it's a pretty good lib for transliteration, but it turns everything into latin ascii characters, afaik. So it's too much of collateral damage for users who use anything but latin. But thanks for a good lib!
there is a lot more options but I admit it's hard to find something that work. It works fine for names though I think? By the way, if you want to look up more solution on this topic, this is related to "unicode normalization". There a lot more related keywords such as "unicode vector attack", etc which are also related on some degree @winwin
those aren't names, unfortunately. People mostly set as their names whatever they can come up with in Telegram. And even if they were, I wouldn't want to transliterate them, because my app is mostly cyrillic.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.