Some questions about Regex in Python

Question

I would like to do some text conversion, such as reading in from a text file:

CONTENTS
1. INTRODUCTION
1.1 The Linear Programming Problem 2
1.2 Examples of Linear Problems 7

and writing to another text file:

("CONTENTS" "#") 
("1. INTRODUCTION" "#") 
("1.1 The Linear Programming Problem 2" "#11")  
("1.2 Examples of Linear Problems 7" "#16")

The current Python code I use for such conversion is:

infile = open(infilename)
outfile = open(outfilename, "w")

pat = re.compile('^(.+?(\d+)) *$',re.M)
def zaa(mat):
    return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))

outfile.write('(bookmarks \n')
for line in infile:
    outfile.write(pat.sub(zaa,line))
outfile.write(')')

It will convert the original text to
```
CONTENTS
1. INTRODUCTION
("1.1 The Linear Programming Problem 2" "#11")
("1.2 Examples of Linear Problems 7" "#16")
```
The last two lines are correct, but the first two lines are not. So I was wondering how to accommodate the first two lines, by modifying the current code, or using some different code?
The code was not written by me, but I would like to understand the usage of re.sub() here. As I found from a Python website,

re.sub(regex, replacement, subject) performs a search-and-replace across subject, replacing all matches of regex in subject with replacement. The result is returned by the sub() function. The subject string you pass is not modified.

But in my code, its usage is `pat.sub(zaa,line)', which seems to me not consistent to the quoted description. So I was wondering how to understand the usage in my code?

Thanks!

Is this the real code? You are adding 11, but 2+11 = 13 not 11. — Mikel
– Mikel, Commented Apr 3, 2011 at 2:55
@Mikel: Thanks for pointing it out. My typo. Just corrected. — Tim
– Tim, Commented Apr 3, 2011 at 2:56
I got confused about the re.sub() thing too. Turns out there are two sub functions: re.sub(pattern, repl, string[, count]) and another to be used with a compiled regex object: RegexObject.sub(repl, string[, count=0]). This function is using the latter syntax. — ridgerunner
– ridgerunner, Commented Apr 3, 2011 at 3:17

BudgieInWA · Accepted Answer · 2011-04-03 03:31:27Z

3

With your regex you are searching for a line that ends with a number (and maybe trailing whitespace). You could make the number optional: ^(.+?(\d+)?) *$ and make sure your group 2 reference inside zaa can handle an empty string.

def zaa(mat):
    return '("%s" "#%s")' % (mat.group(1), (str(int(mat.group(2))+9) if mat.group(2) else "") )

With this, you should get "#" when mat.group(2) is empty, and what your currently get, when it's not empty.

edited Apr 3, 2011 at 3:31

answered Apr 3, 2011 at 3:03

BudgieInWA

2,2761 gold badge18 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Tim Over a year ago

Thanks! I was wondering how to make sure my group 2 reference inside zaa can handle an empty string?

BudgieInWA Over a year ago

@Tim, I've edited my answer to make it a little more clear. My copy of zaa should gracefully handle map.group(2) being empty.

Tim Over a year ago

Thanks! I got a "SyntaxError: invalid syntax" at the "?".

BudgieInWA Over a year ago

D'oh. Ok, this one compiles now :)

ridgerunner · Accepted Answer · 2011-04-03 03:24:56Z

2

This tested script generates the desired output:

import re
infilename = "infile.txt"
outfilename = "outfile.txt"

infile = open(infilename)
outfile = open(outfilename, "w")

pat = re.compile('^(.+?(\d*)) *$',re.M)
def zaa(mat):
    if mat.group(2):
        return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))
    else:
        return '("%s" "#")' % (mat.group(1))

outfile.write('(bookmarks \n')
for line in infile:
    outfile.write(pat.sub(zaa,line))
outfile.write(')')

answered Apr 3, 2011 at 3:24

ridgerunner

34.6k6 gold badges60 silver badges70 bronze badges

3 Comments

Tim Over a year ago

Thanks! Work like a charm! I was wondering ".+" means repeating a character one or more time, or a sequence of one or more characters which are not necessarily the same? If one of the two is what it means, what is the regex to mean the other?

Mikel Over a year ago

.+ means one or more (possibly different) characters. (.)\1+ means at least two of the same character.

ridgerunner Over a year ago

The dot means match any one character (except a newline - unless the 's' modifier is set - in which case the dot matches any char including a newline). The plus is a quantifier added to any token which means one or more of the preceding token. The star is similar but it means zero or more of the preceding token.

senderle · Accepted Answer · 2011-04-03 03:10:38Z

1

But in my code, its usage is pat.sub(zaa,line), which seems to me not consistent to the quoted description.

The difference is in the sub call; the documentation you quote is to the re.sub function, but what is being used here is the sub method of a compiled regular expression object. The initial pattern argument in re.sub() is replaced with the regular expression object to which the sub method is bound. So in other words,

pat.sub(zaa, line)

is equivalent to

re.sub(pat, zaa, line)

Terrible variable names by the way.

answered Apr 3, 2011 at 3:10

senderle

152k36 gold badges218 silver badges244 bronze badges

1 Comment

Tim Over a year ago

Thanks! Is there description about the sub method for a regex object on Python official website?

Collectives™ on Stack Overflow

Some questions about Regex in Python

3 Answers 3

4 Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related