1

I would like to do some text conversion, such as reading in from a text file:

CONTENTS
1. INTRODUCTION
1.1 The Linear Programming Problem 2
1.2 Examples of Linear Problems 7

and writing to another text file:

("CONTENTS" "#") 
("1. INTRODUCTION" "#") 
("1.1 The Linear Programming Problem 2" "#11")  
("1.2 Examples of Linear Problems 7" "#16")

The current Python code I use for such conversion is:

infile = open(infilename)
outfile = open(outfilename, "w")

pat = re.compile('^(.+?(\d+)) *$',re.M)
def zaa(mat):
    return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))

outfile.write('(bookmarks \n')
for line in infile:
    outfile.write(pat.sub(zaa,line))
outfile.write(')')
  1. It will convert the original text to

    CONTENTS
    1. INTRODUCTION
    ("1.1 The Linear Programming Problem 2" "#11")
    ("1.2 Examples of Linear Problems 7" "#16")
    

    The last two lines are correct, but the first two lines are not. So I was wondering how to accommodate the first two lines, by modifying the current code, or using some different code?

  2. The code was not written by me, but I would like to understand the usage of re.sub() here. As I found from a Python website,

    re.sub(regex, replacement, subject) performs a search-and-replace across subject, replacing all matches of regex in subject with replacement. The result is returned by the sub() function. The subject string you pass is not modified.

    But in my code, its usage is `pat.sub(zaa,line)', which seems to me not consistent to the quoted description. So I was wondering how to understand the usage in my code?

Thanks!

3
  • Is this the real code? You are adding 11, but 2+11 = 13 not 11. Commented Apr 3, 2011 at 2:55
  • @Mikel: Thanks for pointing it out. My typo. Just corrected. Commented Apr 3, 2011 at 2:56
  • I got confused about the re.sub() thing too. Turns out there are two sub functions: re.sub(pattern, repl, string[, count]) and another to be used with a compiled regex object: RegexObject.sub(repl, string[, count=0]). This function is using the latter syntax. Commented Apr 3, 2011 at 3:17

3 Answers 3

3

With your regex you are searching for a line that ends with a number (and maybe trailing whitespace). You could make the number optional: ^(.+?(\d+)?) *$ and make sure your group 2 reference inside zaa can handle an empty string.

def zaa(mat):
    return '("%s" "#%s")' % (mat.group(1), (str(int(mat.group(2))+9) if mat.group(2) else "") )

With this, you should get "#" when mat.group(2) is empty, and what your currently get, when it's not empty.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! I was wondering how to make sure my group 2 reference inside zaa can handle an empty string?
@Tim, I've edited my answer to make it a little more clear. My copy of zaa should gracefully handle map.group(2) being empty.
Thanks! I got a "SyntaxError: invalid syntax" at the "?".
D'oh. Ok, this one compiles now :)
2

This tested script generates the desired output:

import re
infilename = "infile.txt"
outfilename = "outfile.txt"

infile = open(infilename)
outfile = open(outfilename, "w")

pat = re.compile('^(.+?(\d*)) *$',re.M)
def zaa(mat):
    if mat.group(2):
        return '("%s" "#%s")' % (mat.group(1),str(int(mat.group(2))+9))
    else:
        return '("%s" "#")' % (mat.group(1))

outfile.write('(bookmarks \n')
for line in infile:
    outfile.write(pat.sub(zaa,line))
outfile.write(')')

3 Comments

Thanks! Work like a charm! I was wondering ".+" means repeating a character one or more time, or a sequence of one or more characters which are not necessarily the same? If one of the two is what it means, what is the regex to mean the other?
.+ means one or more (possibly different) characters. (.)\1+ means at least two of the same character.
The dot means match any one character (except a newline - unless the 's' modifier is set - in which case the dot matches any char including a newline). The plus is a quantifier added to any token which means one or more of the preceding token. The star is similar but it means zero or more of the preceding token.
1

But in my code, its usage is pat.sub(zaa,line), which seems to me not consistent to the quoted description.

The difference is in the sub call; the documentation you quote is to the re.sub function, but what is being used here is the sub method of a compiled regular expression object. The initial pattern argument in re.sub() is replaced with the regular expression object to which the sub method is bound. So in other words,

pat.sub(zaa, line)

is equivalent to

re.sub(pat, zaa, line)

Terrible variable names by the way.

1 Comment

Thanks! Is there description about the sub method for a regex object on Python official website?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.