Python regex optional number match returns more than expected

Question

I have a list of files, and I am trying to filter for a subset of file names that end in 000000, 060000, 120000, 180000. I know I could do a straight string match, but I would like to understand why the regular expression I attempted below r'[00|06|12|18]+0000', would not work (it is returning MSM_20130519210000.csv as well). I intend it to be match either one of 00, 06, 12, 18, follow by 0000. How can that be accomplished? Please keep the answer along the line of this intended regex instead of other functions, thanks.

Here is the code snippet:

import re

files_in_input_directory = ['MSM_20130519150000.csv', 'MSM_20130519180000.csv', 'MSM_20130519210000.csv', 
'MSM_20130520000000.csv', 'MSM_20130520030000.csv', 'MSM_20130520060000.csv', 'MSM_20130520090000.csv', 
'MSM_20130520120000.csv', 'MSM_20130520150000.csv', 'MSM_20130520180000.csv', 'MSM_20130520210000.csv', 
'MSM_20130521000000.csv', 'MSM_20130521030000.csv', 'MSM_20130521060000.csv', 'MSM_20130521090000.csv', 
'MSM_20130521120000.csv', 'MSM_20130521150000.csv', 'MSM_20130521180000.csv', 'MSM_20130521210000.csv', 
'MSM_20130522000000.csv', 'MSM_20130522030000.csv', 'MSM_20130522060000.csv', 'MSM_20130522090000.csv', 
'MSM_20130522120000.csv', 'MSM_20130522150000.csv', 'MSM_20130522180000.csv', 'MSM_20130522210000.csv', 
'MSM_20130523000000.csv', 'MSM_20130523030000.csv', 'MSM_20130523060000.csv', 'MSM_20130523090000.csv', 
'MSM_20130523120000.csv', 'MSM_20130523150000.csv', 'MSM_20130523180000.csv', 'MSM_20130523210000.csv', 
'MSM_20130524000000.csv', 'MSM_20130524030000.csv', 'MSM_20130524060000.csv', 'MSM_20130524090000.csv', 
'MSM_20130524120000.csv', 'MSM_20130524150000.csv', 'MSM_20130524180000.csv', 'MSM_20130524210000.csv', 
'MSM_20130525000000.csv', 'MSM_20130525030000.csv', 'MSM_20130525060000.csv', 'MSM_20130525090000.csv', 
'MSM_20130525120000.csv', 'MSM_20130525150000.csv', 'MSM_20130525180000.csv', 'MSM_20130525210000.csv', 
'MSM_20130526000000.csv', 'MSM_20130526030000.csv', 'MSM_20130526060000.csv', 'MSM_20130526090000.csv', 
'MSM_20130526120000.csv', 'MSM_20130526150000.csv', 'MSM_20130526180000.csv', 'MSM_20130526210000.csv', 
'MSM_20130527000000.csv', 'MSM_20130527030000.csv', 'MSM_20130527060000.csv', 'MSM_20130527090000.csv', 
'MSM_20130527120000.csv', 'MSM_20130527150000.csv', 'MSM_20130527180000.csv', 'MSM_20130527210000.csv', 
'MSM_20130528000000.csv', 'MSM_20130528030000.csv', 'MSM_20130528060000.csv', 'MSM_20130528090000.csv', 
'MSM_20130528120000.csv', 'MSM_20130528150000.csv', 'MSM_20130528180000.csv', 'MSM_20130528210000.csv', 
'MSM_20130529000000.csv', 'MSM_20130529030000.csv', 'MSM_20130529060000.csv', 'MSM_20130529090000.csv']

print files_in_input_directory
print "\n"

# trying to match any string with 000000, 060000, 120000, 180000
# Question: I use + meaning one or more, and | to indicates the options, but this will match
# 'MSM_20130519210000.csv' as well, and I don't know why
print filter(lambda x:re.search(r'[00|06|12|18]+0000', x), files_in_input_directory)
print "\n"

# This verbose version works
print filter(lambda x:re.search(r'0000000|060000|120000|180000', x), files_in_input_directory)
print "\n"

Tim Pierce · Accepted Answer · 2013-12-09 05:42:24Z

1

If you are trying to match filenames that contain 000000, 060000, 120000 or 180000, then instead of

re.search(r'[00|06|12|18]+0000', x)

use

re.search(r'(00|06|12|18)0000', x)

The square brackets [...] only match a single character at a time, and the + character means "match 1 or more of the preceding expression".

answered Dec 9, 2013 at 5:42

Tim Pierce

5,7041 gold badge18 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

roippi · Accepted Answer · 2013-12-09 05:46:40Z

0

[00|06|12|18] is the character set matching 00|06|12|18. Thus it will match 210000 in "SM_20130519210000.csv" because [00|06|12|18] is equivalent to writing [01268]. Not what you meant, I should think.

Instead of expressing a character set that can match one or more times, make it either a capturing group

r'(00|06|12|18)0000'

Or a negative lookbehind expression

r'(?<=00|06|12|18)0000'

They are equivalent for your purposes, since you don't care about the match or any groups.

answered Dec 9, 2013 at 5:46

roippi

26k4 gold badges52 silver badges75 bronze badges

Comments

James Mills · Accepted Answer · 2013-12-09 05:49:48Z

0

The basic problem here is you were not grouping the patterns, but creating a character set fo match against using ``[ ... ]```.

This regex works: ((000)|(06)|(12)|(18))0000

answered Dec 9, 2013 at 5:49

James Mills

19.1k4 gold badges53 silver badges63 bronze badges

Collectives™ on Stack Overflow

Python regex optional number match returns more than expected

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related