Escaping invalid markdown using python regex

Question

I've been trying to write some python to escape 'invalid' markdown strings.

This is for use with a python library (python-telegram-bot) which requires unused markdown characters to be escaped with a \.

My aim is to match lone *,_,` characters, as well as invalid hyperlinks - eg, if no link is provided, and escape them.

An example of what I'm looking for is:

*hello* is fine and should not be changed, whereas hello* would become hello\*. On top of that, if values are nested, they should not be escaped - eg _hello*_ should remain unchanged.

My thought was to match all the doubles first, and then replace any leftover lonely characters. I managed a rough version of this using re.finditer():

 def parser(txt):
   match_md = r'(\*)(.+?)(\*)|(\_)(.+?)(\_)|(`)(.+?)(`)|(\[.+?\])(\(.+?\))|(?P<astx>\*)|(?P<bctck>`)|(?P<undes>_)|(?P<sqbrkt>\[)'
   for e in re.finditer(match_md, txt):
     if e.group('astx') or e.group('bctck') or e.group('undes') or e.group('sqbrkt'):
       txt = txt[:e.start()] + '\\' + txt[e.start():]
   return txt

note: regex was written to match *text*, _text_, `text`, [text](url), and then single *, _, `, [, knowing the last groups

But the issue here, is of course that the offset changes as you insert more characters, so everything shifts away. Surely there's a better way to do this than adding an offset counter?

I tried to use re.sub(), but I haven't been able to find how to replace a specific group, or had any luck with (?:) to 'not match' the valid markdown.

This was my re.sub attempt:

def test(txt):
  match_md = r'(?:(\*)(.+?)(\*))|' \
             '(?:(\_)(.+?)(\_))|' \
             '(?:(`)(.+?)(`))|' \
             '(?:(\[.+?\])(\(.+?\)))|' \
             '(\*)|' \
             '(`)|' \
             '(_)|' \
             '(\[)'
  return re.sub(match_md, "\\\\\g<0>", txt)

This just prefixed every match with a backslash (which was expected, but I'd hoped the ?: would stop them being matched.)

Bonus would be if \'s already in the string were escaped too, so that they wouldn't interfere with the markdown present - this could be a source of error, as the library would see it as escaped, causing it see the rest as invalid.

Thanks in advance!

yacc · Accepted Answer · 2017-09-03 03:11:52Z

2

You are probably looking for a regular expression like this:

def test(txt):
  match_md = r'((([_*]).+?\3[^_*]*)*)([_*])'
  return re.sub(match_md, "\g<1>\\\\\g<4>", txt)

Note that for clarity I just made up a sample for * and _. You can expand the list in the [] brackets easily. Now let's take a look at this thing.

The idea is to crunch through strings that look like *foo_* or _bar*_ followed by text that doesn't contain any specials. The regex that matches such a string is ([_*]).+?\1[^_*]*: We match an opening delimiter, save it in \1, and go further along the line until we see the same delimiter (now closing). Then we eat anything behind that that doesn't contain any delimiters.

Now we want to do that as long as no more delimited strings remain, that's done with (([_*]).+?\2[^_*]*)*. What's left on the right side now, if anything, is an isolated special, and that's what we need to mask. After the match we have the following sub matches:

g<0> : the whole match
g<1> : submatch of ((([_*]).+?\3[^_*]*)*)
g<2> : submatch of (([_*]).+?\3[^_*]*)
g<3> : submatch of ([_*]) (hence the \3 above)
g<4> : submatch of ([_*]) (the one to mask)

What's left to you now is to find a way how to treat the invalid hyperlinks, that's another topic.

Update:
Unfortunately this solution masks out valid markdown such as *hello* (=> \*hello\*). The work around to fix this would be to add a special char to the end of line and remove the masked special char once the substitution is done. OP might be looking for a better solution.

edited Sep 3, 2017 at 3:11

answered Sep 2, 2017 at 23:26

yacc

3,4169 gold badges25 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

SonOfLars Over a year ago

Wow, this is great! thank you so much. My only issue with it is that when I ran it through timeit(), I found it was nearly twice as slow as my awkward counter-based function - any ideas how it could be optimised?

yacc Over a year ago

No idea, I'm not so proficient in Python. You could put it into a new question since this could be interesting to know for others.

SonOfLars Over a year ago

I see, thanks anyway! A side note though - after having played with it a little, I notice that if the string consists of valid markdown, your solution escapes it. for example, passing hello will return *hello*, instead of not touching it. Any ideas?

yacc Over a year ago

Gotcha. I suggest to add a special to the line and remove it afterwards (I suppose you're working line by line). It's not the best solution, maybe you should consider to parse the whole thing.

SonOfLars Over a year ago

I do parse the entire text in one go, so that isn't the issue. And adding a special would cause issues as to which 'pairs' of characters to be matching - it might not escape a character, given that it was matched with the added special.

|

Collectives™ on Stack Overflow

Escaping invalid markdown using python regex

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related