1

I have a list of compounds like this:

ex = ['CrO3', 'Cr8O21', 'NbCrO4']

And I would like to get the elements and numbers separately. Something like this:

['Cr','O',3]

['Cr',8,'O',21]

['Nb','Cr','O',4]

However, this HAS to be a general process - these will not always be the compounds I am working with. I think this can be accomplished using regex and the split() function. However, I am having trouble finding the right regex expression that gets me what I want.

Here is what I have right now:

# elements to split by
split_elements = ['Cr','Nb','O']

def split(compound, split_elements):
    separated = []
    splitstr = ")|(?=".join([str(elem) for elem in split_elements]) 
    splitstr = '('+splitstr+')'
    # splitstr will end up like this: 
    # (Cr)|(?=Nb)|(?=O)

    result = list(filter(None,re.split(splitstr, compound)))
    separated.append(result)

    return(separated)

for item in ex:
    print(split(item, split_elements))
# Output
# [['Cr', 'O3']]
# [['Cr', '8O21']]
# [['Nb', 'Cr', 'O4']]

As you can see, the numbers are still attached, and I'm not sure why. I've searched for a similar issue, but I can't find any (and what I have right now is already the result of furious googling).

Does anyone have any solutions or suggestions?

3 Answers 3

1

Don't use split, use re.findall() and write a regexp that matches either case: Uppercase optionally followed by lowercase, or any number of digits.

re.findall(r'[A-Z][a-z]?|\d+', compound)
Sign up to request clarification or add additional context in comments.

1 Comment

I think this will work! I will update on whether it works on everything!
1

You can use re.findall to decompose the chemical compounds.

>>> import re    
>>> ex = ['CrO3', 'Cr8O21', 'NbCrO4']
>>> re.findall('[A-Z][a-z]?|\d*', ex[0])
['Cr', 'O', '3']
>>> re.findall('[A-Z][a-z]*|\d+', ex[1])
['Cr', '8', 'O', '21']
>>> re.findall('[A-Z][a-z]*|\d+', ex[2])
['Nb', 'Cr', 'O', '4']

Although you should probably check out one of the many packages on PyPI that deal with chemistry if you plan to do anything more complicated.

1 Comment

[a-z]* is wrong, because IUPAC atomic symbols only ever have at most one lowercase following the (mandatory) uppercase.
0

The regex answers are good, though you could always write a dictionary or list with the elements.

elements = [O, Cr, Nb, ...]

Or look into libraries such as https://pubchempy.readthedocs.io/en/latest/

1 Comment

How does that help with getting the numbers after the chemical symbols?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.