Python: regex substitution

Question

Okay, usually I don't ask these sort of questions.

Using re.sub to find and replace normal strings is straightforward, but how do regular expressions in the replacement part (rather than the matching part) work?

In particular, in reference to Brian Okken's web-page which purports to explain exactly this, providing code to replicate the same sort of functionality that he was used to in Pearl, but had struggled to develop in python.

import fileinput
import re

for line in fileinput.input():
    line = re.sub(r'\* \[(.*)\]\(#(.*)\)', r'<h2 id="\2">\1</h2>', line.rstrip())
    print(line)

This sub is meant to match

* [the label](#the_anchor)

and replace it with

<h2 id="the_anchor">the label</h2>

It works: but how does the script know exactly what the label and anchor are? Presumably \1 and \2 are meant to match the desired text, but how does the script know this and not think, perhaps, that the leading * refers to \1?

Because of parentheses. \1 in the substitution refers to whatever had matched the first pair of parens (i.e. the first (.*)) in the regex. — drdaeman
– drdaeman, Commented May 13, 2016 at 14:15
\1, \2 are the first and second matched group from the pattern to be replaced. Groups are the parts of the pattern in parentheses. — user2390182
– user2390182, Commented May 13, 2016 at 14:16
\(GroupReference) is meant to reference groups that were in the matching text. If you do not know what groups are, I suggest looking into those. In this case, \1 and \2 are references to groups 1 and 2, in other words the things inside of the first and second pair of () brackets, respectively. — R Nar
– R Nar, Commented May 13, 2016 at 14:17

ErikR · Accepted Answer · 2016-05-13 14:15:24Z

4

The \1 and \2 in the replacement string refer to the first and second "captures". Captures are parts of the pattern regex which are surrounded in parentheses.

For instance, here are the captures in the example regex:

r'\* \[(.*)\]\(#(.*)\)'
       ^^^^     ^^^^

So \1 refers to whatever was matched by the first capture, and \2 refers to whatever was matched by the second capture.

answered May 13, 2016 at 14:15

ErikR

52.2k9 gold badges78 silver badges136 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Stumbler Over a year ago

Why did I not think a regular expression would count from zero? smh

ErikR Over a year ago

Actually, in many regex implementations \0 refers to the entire matched string. This is probably true of Python regexs.

Wiktor Stribiżew Over a year ago

In Python, to access the whole match in the replacement, one needs r'\g<0>'. Just read the docs.

Collectives™ on Stack Overflow

Python: regex substitution

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related