I've been trying to write some python to escape 'invalid' markdown strings.
This is for use with a python library (python-telegram-bot) which requires unused markdown characters to be escaped with a \.
My aim is to match lone *,_,` characters, as well as invalid hyperlinks - eg, if no link is provided, and escape them.
An example of what I'm looking for is:
*hello* is fine and should not be changed, whereas hello* would become hello\*. On top of that, if values are nested, they should not be escaped - eg _hello*_ should remain unchanged.
My thought was to match all the doubles first, and then replace any leftover lonely characters. I managed a rough version of this using re.finditer():
def parser(txt):
match_md = r'(\*)(.+?)(\*)|(\_)(.+?)(\_)|(`)(.+?)(`)|(\[.+?\])(\(.+?\))|(?P<astx>\*)|(?P<bctck>`)|(?P<undes>_)|(?P<sqbrkt>\[)'
for e in re.finditer(match_md, txt):
if e.group('astx') or e.group('bctck') or e.group('undes') or e.group('sqbrkt'):
txt = txt[:e.start()] + '\\' + txt[e.start():]
return txt
note: regex was written to match *text*, _text_, `text`, [text](url), and then single *, _, `, [, knowing the last groups
But the issue here, is of course that the offset changes as you insert more characters, so everything shifts away. Surely there's a better way to do this than adding an offset counter?
I tried to use re.sub(), but I haven't been able to find how to replace a specific group, or had any luck with (?:) to 'not match' the valid markdown.
This was my re.sub attempt:
def test(txt):
match_md = r'(?:(\*)(.+?)(\*))|' \
'(?:(\_)(.+?)(\_))|' \
'(?:(`)(.+?)(`))|' \
'(?:(\[.+?\])(\(.+?\)))|' \
'(\*)|' \
'(`)|' \
'(_)|' \
'(\[)'
return re.sub(match_md, "\\\\\g<0>", txt)
This just prefixed every match with a backslash (which was expected, but I'd hoped the ?: would stop them being matched.)
Bonus would be if \'s already in the string were escaped too, so that they wouldn't interfere with the markdown present - this could be a source of error, as the library would see it as escaped, causing it see the rest as invalid.
Thanks in advance!