Regex in python: matching duplicates of optional substrings

Question

I am developing a python package that needs to, among other things, process a file containing a list of dataset names and I need to extract the components of these names.

Examples of dataset names would be:

diskLineLuminosity:halpha:rest:z1.0
diskLineLuminosity:halpha:rest:z1.0:dust
diskLineLuminosity:halpha:rest:z1.0:contam_NII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OII:contam_OIII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OIII:dust
diskLineLuminosity:halpha:rest:z1.0:contam_OII:contam_NII
diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent

I'm looking for a way to parse the dataset names using regex to extract all the dataset information, including a list of all instances of "contam_*" (where zero instances are allowed). I realise that I could just split the string and used fnmatch.filter, or equivalent, but I also need to be able to flag erroneous dataset names that do not match the above syntax. Also, regex is currently used extensively in similar situations throughout the package and so I prefer not to introduce a second parsing method.

As an MWE, with an example dataset name, I have pieced together:

import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)

This returns:

print M.group(1,2,3,4,5,6,7)
('disk', 'halpha', 'rest', '1.0', None, ':contam_NII', None)

In the package, this regex search needs to go into a function similar to:

def getDatasetNameInformation(datasetName):
    INFO = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)
    if not INFO:
        raise ParseError("Cannot parse '"+datasetName+"'!")
    return INFO

I am still new to using regex so how can I modify the re.search string to successfully parse all of the above dataset names and extract the information in the substrings (including a list of all the instances of contamination)?

Thanks for any help you can provide!

Jan · Accepted Answer · 2018-02-07 21:00:22Z

2

If you are still learning regular expressions (to be honest, later as well), get in the habit of using the verbose mode as often as possible, it makes for better code and more readable expressions.

That said, you could use

^
(disk|spheroid)
LineLuminosity:
([^:]+):
([^:]+):
z([\d\.]+)
((?::contam_[^:]+)+)?
(:recent)?
(:dust[^:]*)?

Just changed the order a bit and used a non-capturing group inside he contam part, see a demo on regex101.com.

answered Feb 7, 2018 at 21:00

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

trincot · Accepted Answer · 2018-02-07 21:10:40Z

0

You could capture all of those contam_ with ((?::contam_[^:]+)*): this will capture all of them in one group. Then launch a second regular expression, apply it just on that match alone, and use that result as a nested list within the first results:

import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:recent:contam_NII:contam_NII:dust"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(?::(recent))?((?::contam_[^:]+)*)(?::(dust))?",datasetName)
lst = list(M.groups())
if lst[5]:
    lst[5] = re.findall(":contam_([^:]+)", lst[5])

print(lst)

Output:

['disk', 'halpha', 'rest', '1.0', 'recent', ['NII', 'NII'], 'dust']

answered Feb 7, 2018 at 21:10

trincot

357k38 gold badges282 silver badges339 bronze badges

Collectives™ on Stack Overflow

Regex in python: matching duplicates of optional substrings

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related