4

Okay, usually I don't ask these sort of questions.

Using re.sub to find and replace normal strings is straightforward, but how do regular expressions in the replacement part (rather than the matching part) work?

In particular, in reference to Brian Okken's web-page which purports to explain exactly this, providing code to replicate the same sort of functionality that he was used to in Pearl, but had struggled to develop in python.

import fileinput
import re

for line in fileinput.input():
    line = re.sub(r'\* \[(.*)\]\(#(.*)\)', r'<h2 id="\2">\1</h2>', line.rstrip())
    print(line)

This sub is meant to match

* [the label](#the_anchor)

and replace it with

<h2 id="the_anchor">the label</h2>

It works: but how does the script know exactly what the label and anchor are? Presumably \1 and \2 are meant to match the desired text, but how does the script know this and not think, perhaps, that the leading * refers to \1?

3
  • Because of parentheses. \1 in the substitution refers to whatever had matched the first pair of parens (i.e. the first (.*)) in the regex. Commented May 13, 2016 at 14:15
  • \1, \2 are the first and second matched group from the pattern to be replaced. Groups are the parts of the pattern in parentheses. Commented May 13, 2016 at 14:16
  • \(GroupReference) is meant to reference groups that were in the matching text. If you do not know what groups are, I suggest looking into those. In this case, \1 and \2 are references to groups 1 and 2, in other words the things inside of the first and second pair of () brackets, respectively. Commented May 13, 2016 at 14:17

1 Answer 1

4

The \1 and \2 in the replacement string refer to the first and second "captures". Captures are parts of the pattern regex which are surrounded in parentheses.

For instance, here are the captures in the example regex:

r'\* \[(.*)\]\(#(.*)\)'
       ^^^^     ^^^^

So \1 refers to whatever was matched by the first capture, and \2 refers to whatever was matched by the second capture.

Sign up to request clarification or add additional context in comments.

3 Comments

Why did I not think a regular expression would count from zero? smh
Actually, in many regex implementations \0 refers to the entire matched string. This is probably true of Python regexs.
In Python, to access the whole match in the replacement, one needs r'\g<0>'. Just read the docs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.