Python split string by multiple specific characters and keep the "delimiters"

Question

I have a list of compounds like this:

ex = ['CrO3', 'Cr8O21', 'NbCrO4']

And I would like to get the elements and numbers separately. Something like this:

['Cr','O',3]

['Cr',8,'O',21]

['Nb','Cr','O',4]

However, this HAS to be a general process - these will not always be the compounds I am working with. I think this can be accomplished using regex and the split() function. However, I am having trouble finding the right regex expression that gets me what I want.

Here is what I have right now:

# elements to split by
split_elements = ['Cr','Nb','O']

def split(compound, split_elements):
    separated = []
    splitstr = ")|(?=".join([str(elem) for elem in split_elements]) 
    splitstr = '('+splitstr+')'
    # splitstr will end up like this: 
    # (Cr)|(?=Nb)|(?=O)

    result = list(filter(None,re.split(splitstr, compound)))
    separated.append(result)

    return(separated)

for item in ex:
    print(split(item, split_elements))
# Output
# [['Cr', 'O3']]
# [['Cr', '8O21']]
# [['Nb', 'Cr', 'O4']]

As you can see, the numbers are still attached, and I'm not sure why. I've searched for a similar issue, but I can't find any (and what I have right now is already the result of furious googling).

Does anyone have any solutions or suggestions?

Barmar · Accepted Answer · 2020-06-10 05:17:36Z

1

Don't use split, use re.findall() and write a regexp that matches either case: Uppercase optionally followed by lowercase, or any number of digits.

re.findall(r'[A-Z][a-z]?|\d+', compound)

answered Jun 10, 2020 at 5:17

Barmar

789k57 gold badges555 silver badges669 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

hschroad Over a year ago

I think this will work! I will update on whether it works on everything!

timgeb · Accepted Answer · 2020-06-10 05:38:41Z

1

You can use re.findall to decompose the chemical compounds.

>>> import re    
>>> ex = ['CrO3', 'Cr8O21', 'NbCrO4']
>>> re.findall('[A-Z][a-z]?|\d*', ex[0])
['Cr', 'O', '3']
>>> re.findall('[A-Z][a-z]*|\d+', ex[1])
['Cr', '8', 'O', '21']
>>> re.findall('[A-Z][a-z]*|\d+', ex[2])
['Nb', 'Cr', 'O', '4']

Although you should probably check out one of the many packages on PyPI that deal with chemistry if you plan to do anything more complicated.

edited Jun 10, 2020 at 5:38

answered Jun 10, 2020 at 5:17

timgeb

79.2k20 gold badges129 silver badges150 bronze badges

1 Comment

Tim Biegeleisen Over a year ago

[a-z]* is wrong, because IUPAC atomic symbols only ever have at most one lowercase following the (mandatory) uppercase.

iansedano · Accepted Answer · 2020-06-10 05:21:17Z

0

The regex answers are good, though you could always write a dictionary or list with the elements.

elements = [O, Cr, Nb, ...]

Or look into libraries such as https://pubchempy.readthedocs.io/en/latest/

answered Jun 10, 2020 at 5:21

iansedano

6,5212 gold badges15 silver badges29 bronze badges

1 Comment

Barmar Over a year ago

How does that help with getting the numbers after the chemical symbols?

Collectives™ on Stack Overflow

Python split string by multiple specific characters and keep the "delimiters"

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related