0

I have a file name which always ends on a number proceeding with its file extension such as:

filename = 'photo_v_01_20415.jpg'

From its filename I need to extract the file_extension and the last number that sits right before the file extension itslelf. As a result of split I should have two strings:

original_string = 'photo_v_01_20415.jpg'

string_result_01 = `photo_v_01_`  (first half of the file name)

string_result_02 = `20415.jpg`    (second half of the file name).

The problem is that the incoming filenames will be inconsistent. The last number could be separated from its file_name by underscore "_", by empty space " ", by period "." or anything else. Example of possible file names:

photo_v_01_20415.jpg
photo_v_01.20415.jpg
photo_v_01 20415.jpg
photo_v_01____20415.jpg

It appears I need to be using re. expressions with re.search or re.sub. I would appreciate for any suggestions!

3 Answers 3

3

Use re.match instead of re.search to match all of the string to the pattern. Thus

import re

def split_name(filename):
    match = re.match(r'(.*?)(\d+\.[^.]+)', filename)
    if match:
        return match.groups()
    else:
        return None, None

for name in [ 'foo123.jpg', 'bar;)234.png', 'baz^_^456.JPEG', 'notanumber.bmp' ]:
    prefix, suffix = split_name(name)
    print("prefix = %r, suffix = %r" % (prefix, suffix))

Prints:

prefix = 'foo', suffix = '123.jpg'
prefix = 'bar;)', suffix = '234.png'
prefix = 'baz^_^', suffix = '456.JPEG'
prefix = None, suffix = None

Works for arbitrary suffixes; if the filename does not match the pattern, then the match fails, and None, None is returned.

Sign up to request clarification or add additional context in comments.

Comments

3
import re

names = '''\
photo_v_01_20415.jpg
photo_v_01.20415.jpg
photo_v_01 20415.jpg
photo_v_01____20415.jpg'''.splitlines()

for name in names:
    prefix, suffix = re.match(r'(.+?[_. ])(\d+\.[^.]+)$', name).groups()
    print('{} --> {}\t{}'.format(name, prefix, suffix))

yields

photo_v_01_20415.jpg --> photo_v_01_    20415.jpg
photo_v_01.20415.jpg --> photo_v_01.    20415.jpg
photo_v_01 20415.jpg --> photo_v_01     20415.jpg
photo_v_01____20415.jpg --> photo_v_01____  20415.jpg

The regex pattern r'(.+?[_. ])(\d+\.[^.]+)$' means

r'             define a raw string
(              with first group
     .+?           non-greedily match 1-or-more of any character
     [_. ]         followed by a literal underscore, period or space
)              end first group 
(              followed by second group
     \d+           1-or-more digits in [0-9]
     \.            literal period
     [^.]+         1-or-more of anything but a period
)              end second group 
$              match the end of the string
'              end raw string

1 Comment

I've corrected my answer using parts of Antti Haapala solution; apologies to Antti Haapala, I just couldn't stand my answer being wrong. I'll leave my answer up mainly because it explains what the regex means.
0
import re

matcher = re.compile('(.*[._ ])(\d+.jpg)')
result = matcher.match(filename)

Add other options to the [._ ] as necessary.

1 Comment

This solution works very well: prefix, suffix = re.search(r'(.+?[_. ])(\d+.jpg)$', seq_name).groups() But the file extension will not be always 'jpg'. How could I tweak this expression to make it valid for any other than jpg file formats?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.