1

I have a variety of complex filenames that I need to match with a regex. This is the general pattern, optional groups in round brackets:

<main>_<country>-<region>(_<id>)(_<provider>)_<year><month><day>_<hour><minute><second>_<sensor>_<resolution>(_<bittype>).<format>

Here are some examples of what those filenames can look like:

fn1 = 'FOO_is-atest_123456_COMPANY_20190729_153343_SATEL_m0001_32bit.tif'
fn2 = 'FOO_is-atest_COMPANY_20190729_153343_SATEL_m0001_32bit.tif'
fn3 = 'FOO_is-atest_COMPANY_20190729_153343_SATEL_m0001.tif'
fn4 = 'FOO_is-atest_32tnt_20211125_120005_SATEL_m0001.tif'
fn5 = 'FOO_is-atest_20211125_120005_SATEL_cm070.tif'
fn6 = 'FOO_is-atest_20211125_120005_SATEL_cm070_32bit.tif'

The different components can have varying lengths sometimes. The tricky part is that tile and provider can basically consist of any given length and any character.

I just can't get it to match all the cases. Here is the closest I came, using this nice online regex test page here:

import re

pattern = '(?P<product>\w{3})' \
          '_(?P<country>\w{2})' \
          '-(?P<region>\w+)' \
          '_?(?P<tile>\w+)?' \
          '_?(?P<provider>\w+)?' \
          '_(?P<year>\d{4})' \
          '(?P<month>\d{2})' \
          '(?P<day>\d{2})' \
          '_(?P<hour>\d{2})' \
          '(?P<minute>\d{2})' \
          '(?P<second>\d{2})' \
          '_(?P<sensor>\w{5})' \
          '_(?P<res_unit>km|m|cm)' \
          '(?P<resolution>\d{3,4})' \
          '_?(?P<bittype>\d{1,2}bit)?' \
          '.(?P<format>\w+)'

p = re.compile(pattern)

print(p.match(fn1).group('tile'), p.match(fn1).group('provider'))
print(p.match(fn2).group('provider'), p.match(fn2).group('bittype'))
print(p.match(fn3).group('provider'), p.match(fn3).group('resolution'))
print(p.match(fn4).group('tile'), p.match(fn4).group('year'))
print(p.match(fn5).group('provider'), p.match(fn5).group('resolution'))
print(p.match(fn6).group('provider'), p.match(fn5).group('bittype'))

# OUTPUTS:
>>> (None, None)
>>> (None, '32bit')
>>> (None, '0001')
>>> (None, '2021')
>>> (None, '070')
>>> (None, None)

As you see, tile and provider are not correctly recognized, so something it still not right. Everything else seems to work fine. Regexes are still somewhat of a mystery to me, to be honest.

12
  • I think it would at least become better if you replace _?(?P<tile>\w+)? by (_(?P<tile>\w+))?, same for "provider" because the underscore is not independently optional. Commented Nov 25, 2021 at 14:01
  • 2
    So from a regex point of view, if bothg provider and title are just one or more word chars, and only one of them exists. How is it meant to know if it should be the title or the provider. or is there a more specific format that can be used to identify one over the other Commented Nov 25, 2021 at 14:03
  • 1
    Whenever you use \w, mind that it also matches underscores. If you do not expect your "fields" to contain underscores, replace all \w with [^\W_]. Also, _?(?P<name>[^\W_]+)? is not a good way to match optional groups, use (?:_(?P<name>[^\W_]+))? instead. Commented Nov 25, 2021 at 14:14
  • 1
    In this line fn4 = 'FOO_is-atest_32tnt_20211125_120005_SATEL_m0001.tif' for the field 32tnt is this title or provider? how is the regex meant to know? Commented Nov 25, 2021 at 14:20
  • 1
    Do you need regex101.com/r/zwv03L/1 ? Commented Nov 25, 2021 at 14:53

1 Answer 1

1

You can use

^(?P<product>[^\W_]{3})_(?P<country>[^\W_]{2})-(?P<region>\w+?)(?:_(?P<tile>[^_]+))??(?:_(?P<provider>[^\W_]+))?_(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})_(?P<hour>\d{2})(?P<minute>\d{2})(?P<second>\d{2})_(?P<sensor>[^\W_]{5})_(?P<res_unit>km|m|cm)(?P<resolution>\d{3,4})(?:_(?P<bittype>\d{1,2}bit))?\.(?P<format>\w+)$

See the regex demo. Details:

  • ^ - start of string
  • (?P<product>[^\W_]{3}) - Group "product": three alphanumeric chars
  • _ - an underscore
  • (?P<country>[^\W_]{2}) - Group "country": two alphanumeric chars
  • - - a hyphen
  • (?P<region>\w+?) - Group "region": one or more alphanumeric or underscore chars, as few as possible
  • (?:_(?P<tile>[^_]+))?? - an optional sequence of patterns that is matched only if the subsequent patterns in the regex fail to match (see lazy ?? quantifier):
    • _ - an underscore
    • (?P<tile>[^_]+) - Group "title": one or more chars other than _
  • (?:_(?P<provider>[^\W_]+))?
  • _(?P<year>\d{4})
  • (?P<month>\d{2}) - Group "month": two digits
  • (?P<day>\d{2}) - Group "day": two digits
  • _ - an underscore
  • (?P<hour>\d{2}) - Group "hour": two digits
  • (?P<minute>\d{2}) - Group "minute": two digits
  • (?P<second>\d{2}) - Group "second": two digits
  • _ - an underscore
  • (?P<sensor>[^\W_]{5}) - Group "sensor": five alphanumeric chars
  • _ - an underscore
  • (?P<res_unit>km|m|cm) - Group "res_unit": km, m or cm (also [kc]m can be used)
  • (?P<resolution>\d{3,4}) - Group "resolution": three or four digits
  • (?:_(?P<bittype>\d{1,2}bit))? - an optional sequence of _ and then Group "bittype" capturing one or two digits and then bit string
  • \. - a dot
  • (?P<format>\w+) - Group "format": one or more alphanumeric/underscore chars
  • $ - end of string.
Sign up to request clarification or add additional context in comments.

2 Comments

Almost... in the provided examples, example 4 contains 32tnt, which is not the provider, but the tile. As others pointed out, the two cannot be differentiated without a specific rule. I need accept that and work around it in other ways, although I really appreciate your dedication to this question.
@s6hebern It is still usable. Get the group "provider" and "title" with this regex. Then use SpaCy and convert both into SpaCy documents. Check the entities, if one of them is a company name, it is a provider. If you have a finite list of providers you may use a pattern like provider1|provider2|etc instead of a generic pattern.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.