I have a variety of complex filenames that I need to match with a regex. This is the general pattern, optional groups in round brackets:
<main>_<country>-<region>(_<id>)(_<provider>)_<year><month><day>_<hour><minute><second>_<sensor>_<resolution>(_<bittype>).<format>
Here are some examples of what those filenames can look like:
fn1 = 'FOO_is-atest_123456_COMPANY_20190729_153343_SATEL_m0001_32bit.tif'
fn2 = 'FOO_is-atest_COMPANY_20190729_153343_SATEL_m0001_32bit.tif'
fn3 = 'FOO_is-atest_COMPANY_20190729_153343_SATEL_m0001.tif'
fn4 = 'FOO_is-atest_32tnt_20211125_120005_SATEL_m0001.tif'
fn5 = 'FOO_is-atest_20211125_120005_SATEL_cm070.tif'
fn6 = 'FOO_is-atest_20211125_120005_SATEL_cm070_32bit.tif'
The different components can have varying lengths sometimes. The tricky part is that tile and provider can basically consist of any given length and any character.
I just can't get it to match all the cases. Here is the closest I came, using this nice online regex test page here:
import re
pattern = '(?P<product>\w{3})' \
'_(?P<country>\w{2})' \
'-(?P<region>\w+)' \
'_?(?P<tile>\w+)?' \
'_?(?P<provider>\w+)?' \
'_(?P<year>\d{4})' \
'(?P<month>\d{2})' \
'(?P<day>\d{2})' \
'_(?P<hour>\d{2})' \
'(?P<minute>\d{2})' \
'(?P<second>\d{2})' \
'_(?P<sensor>\w{5})' \
'_(?P<res_unit>km|m|cm)' \
'(?P<resolution>\d{3,4})' \
'_?(?P<bittype>\d{1,2}bit)?' \
'.(?P<format>\w+)'
p = re.compile(pattern)
print(p.match(fn1).group('tile'), p.match(fn1).group('provider'))
print(p.match(fn2).group('provider'), p.match(fn2).group('bittype'))
print(p.match(fn3).group('provider'), p.match(fn3).group('resolution'))
print(p.match(fn4).group('tile'), p.match(fn4).group('year'))
print(p.match(fn5).group('provider'), p.match(fn5).group('resolution'))
print(p.match(fn6).group('provider'), p.match(fn5).group('bittype'))
# OUTPUTS:
>>> (None, None)
>>> (None, '32bit')
>>> (None, '0001')
>>> (None, '2021')
>>> (None, '070')
>>> (None, None)
As you see, tile and provider are not correctly recognized, so something it still not right. Everything else seems to work fine. Regexes are still somewhat of a mystery to me, to be honest.
_?(?P<tile>\w+)?by(_(?P<tile>\w+))?, same for "provider" because the underscore is not independently optional.\w, mind that it also matches underscores. If you do not expect your "fields" to contain underscores, replace all\wwith[^\W_]. Also,_?(?P<name>[^\W_]+)?is not a good way to match optional groups, use(?:_(?P<name>[^\W_]+))?instead.fn4 = 'FOO_is-atest_32tnt_20211125_120005_SATEL_m0001.tif'for the field32tntis this title or provider? how is the regex meant to know?