827

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ';' or ', ' That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]') 
0

5 Answers 5

1332

Luckily, Python has this built-in :)

import re

# Regex pattern splits on substrings "; " and ", "
re.split('; |, ', string_to_split)

Update:

Following your comment:

>>> string_to_split = 'Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n', string_to_split)
['Beautiful', 'is', 'better', 'than', 'ugly']
Sign up to request clarification or add additional context in comments.

16 Comments

I'd prefer to write it as: re.split(r';|,\s', a) by replacing ' ' (space character) with '\s' (white space) unless space character is a strict requirement.
I wonder why (regular) split just can't accept a list, that seems like a more obvious way instead of encoding multiple options in a line.
It is worth nothing that this uses some RegEx like things as mentioned above. So trying to split a string with . will split every single character. You need to escape it. \.
Just to add to this a little bit, instead of adding a bunch of or "|" symbols you can do the following: re.split('[;,.\-\%]',str), where inside of [ ] you put all the characters you want to split by.
Is there a way to retain the delimiters in the output but combine them together? I know that doing re.split('(; |, |\*|\n)', a) will retain the delimiters, but how can I combine subsequent delimiters into one element in the output list?
|
529

Do a str.replace('; ', ', ') and then a str.split(', ')

8 Comments

suppose you have a 5 delimeters, you have to traverse your string 5x times
that is very bad for performance
This shows a different vision of yours toward this problem. I think it is a great one. "If you don't know a direct answer, use combination of things you know to solve it".
If you have small number of delimiters and are perormance-constrained, replace trick is fastest of all. 15x faster than regexp, and almost 2x faster than nested for in val.split(...) generator.
Performance is not always a concern. My use case was to process input from a human-entered command line argument so this solution was quite ideal. I also try to avoid regex whenever possible. Easy to create, very difficult to read.
|
194

Here's a safe way for any iterable of delimiters, using regular expressions:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join(map(re.escape, delimiters))
>>> regex_pattern
'a|\\.\\.\\.|\\(c\\)'
>>> re.split(regex_pattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]

re.escape allows to build the pattern automatically and have the delimiters escaped nicely.

Here's this solution as a function for your copy-pasting pleasure:

def split(delimiters, string, maxsplit=0):
    import re
    regex_pattern = '|'.join(map(re.escape, delimiters))
    return re.split(regex_pattern, string, maxsplit)

If you're going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.


If you'd like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regex_pattern
'(?<=a)|(?<=\\.\\.\\.)|(?<=\\(c\\))'
>>> re.split(regex_pattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]

(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)

Comments

98

In response to Jonathan's answer above, this only seems to work for certain delimiters. For example:

>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']

By putting the delimiters in square brackets it seems to work more effectively.

>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']

2 Comments

It works for all the delimiters you specify. A regex of - : matches exactly - : and thus won't split the date/time string. A regex of [- :] matches -, <space>, or : and thus splits the date/time string. If you want to split only on - and : then your regex should be either [-:] or -|:, and if you want to split on -, <space> and : then your regex should be either [- :] or -| |:.
@alldayremix I see my mistake: I missed the fact that your regex contains the OR |. I blindly identified it as a desired separator.
40

This is how the regex look like:

import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")

# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")

print pattern.split(text)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.