0

I am developing a python package that needs to, among other things, process a file containing a list of dataset names and I need to extract the components of these names.

Examples of dataset names would be:

  • diskLineLuminosity:halpha:rest:z1.0
  • diskLineLuminosity:halpha:rest:z1.0:dust
  • diskLineLuminosity:halpha:rest:z1.0:contam_NII
  • diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OII:contam_OIII
  • diskLineLuminosity:halpha:rest:z1.0:contam_NII:contam_OIII:dust
  • diskLineLuminosity:halpha:rest:z1.0:contam_OII:contam_NII
  • diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent

I'm looking for a way to parse the dataset names using regex to extract all the dataset information, including a list of all instances of "contam_*" (where zero instances are allowed). I realise that I could just split the string and used fnmatch.filter, or equivalent, but I also need to be able to flag erroneous dataset names that do not match the above syntax. Also, regex is currently used extensively in similar situations throughout the package and so I prefer not to introduce a second parsing method.

As an MWE, with an example dataset name, I have pieced together:

import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:contam_NII:recent"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)

This returns:

print M.group(1,2,3,4,5,6,7)
('disk', 'halpha', 'rest', '1.0', None, ':contam_NII', None)

In the package, this regex search needs to go into a function similar to:

def getDatasetNameInformation(datasetName):
    INFO = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(:recent)?(:contam_[^:]+)?(:dust[^:]+)?",datasetName)
    if not INFO:
        raise ParseError("Cannot parse '"+datasetName+"'!")
    return INFO

I am still new to using regex so how can I modify the re.search string to successfully parse all of the above dataset names and extract the information in the substrings (including a list of all the instances of contamination)?

Thanks for any help you can provide!

2 Answers 2

2

If you are still learning regular expressions (to be honest, later as well), get in the habit of using the verbose mode as often as possible, it makes for better code and more readable expressions.

That said, you could use

^
(disk|spheroid)
LineLuminosity:
([^:]+):
([^:]+):
z([\d\.]+)
((?::contam_[^:]+)+)?
(:recent)?
(:dust[^:]*)?

Just changed the order a bit and used a non-capturing group inside he contam part, see a demo on regex101.com.

Sign up to request clarification or add additional context in comments.

Comments

0

You could capture all of those contam_ with ((?::contam_[^:]+)*): this will capture all of them in one group. Then launch a second regular expression, apply it just on that match alone, and use that result as a nested list within the first results:

import re
datasetName = "diskLineLuminosity:halpha:rest:z1.0:recent:contam_NII:contam_NII:dust"
M = re.search("^(disk|spheroid)LineLuminosity:([^:]+):([^:]+):z([\d\.]+)(?::(recent))?((?::contam_[^:]+)*)(?::(dust))?",datasetName)
lst = list(M.groups())
if lst[5]:
    lst[5] = re.findall(":contam_([^:]+)", lst[5])

print(lst)

Output:

['disk', 'halpha', 'rest', '1.0', 'recent', ['NII', 'NII'], 'dust']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.