Split string with multiple delimiters in Python [duplicate]

Question

I found some answers online, but I have no experience with regular expressions, which I believe is what is needed here.

I have a string that needs to be split by either a ';' or ', ' That is, it has to be either a semicolon or a comma followed by a space. Individual commas without trailing spaces should be left untouched

Example string:

"b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3], mesitylene [000108-67-8]; polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]"

should be split into a list containing the following:

('b-staged divinylsiloxane-bis-benzocyclobutene [124221-30-3]' , 'mesitylene [000108-67-8]', 'polymerized 1,2-dihydro-2,2,4- trimethyl quinoline [026780-96-1]')

Intrastellar Explorer · Accepted Answer · 2023-12-26 15:51:51Z

1332

Luckily, Python has this built-in :)

import re

# Regex pattern splits on substrings "; " and ", "
re.split('; |, ', string_to_split)

Update:

Following your comment:

>>> string_to_split = 'Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n', string_to_split)
['Beautiful', 'is', 'better', 'than', 'ugly']

edited Dec 26, 2023 at 15:51

Intrastellar Explorer

2,98114 gold badges80 silver badges174 bronze badges

answered Feb 14, 2011 at 23:52

Jonathan Livni

108k112 gold badges278 silver badges367 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Humble Learner Over a year ago

I'd prefer to write it as: re.split(r';|,\s', a) by replacing ' ' (space character) with '\s' (white space) unless space character is a strict requirement.

himself Over a year ago

I wonder why (regular) split just can't accept a list, that seems like a more obvious way instead of encoding multiple options in a line.

marsh Over a year ago

It is worth nothing that this uses some RegEx like things as mentioned above. So trying to split a string with . will split every single character. You need to escape it. \.

jmracek Over a year ago

Just to add to this a little bit, instead of adding a bunch of or "|" symbols you can do the following: re.split('[;,.\-\%]',str), where inside of [ ] you put all the characters you want to split by.

Konstantin Over a year ago

Is there a way to retain the delimiters in the output but combine them together? I know that doing re.split('(; |, |\*|\n)', a) will retain the delimiters, but how can I combine subsequent delimiters into one element in the output list?

|

Joe · Accepted Answer · 2011-02-14 23:47:47Z

529

Do a str.replace('; ', ', ') and then a str.split(', ')

answered Feb 14, 2011 at 23:47

Joe

11.8k7 gold badges52 silver badges60 bronze badges

8 Comments

om-nom-nom Over a year ago

suppose you have a 5 delimeters, you have to traverse your string 5x times

Phyo Arkar Lwin Over a year ago

that is very bad for performance

AliBZ Over a year ago

This shows a different vision of yours toward this problem. I think it is a great one. "If you don't know a direct answer, use combination of things you know to solve it".

monoid Over a year ago

If you have small number of delimiters and are perormance-constrained, replace trick is fastest of all. 15x faster than regexp, and almost 2x faster than nested for in val.split(...) generator.

Craig Jackson Over a year ago

Performance is not always a concern. My use case was to process input from a human-entered command line argument so this solution was quite ideal. I also try to avoid regex whenever possible. Easy to create, very difficult to read.

|

UCYT5040 · Accepted Answer · 2022-08-20 09:21:10Z

Here's a safe way for any iterable of delimiters, using regular expressions:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join(map(re.escape, delimiters))
>>> regex_pattern
'a|\\.\\.\\.|\\(c\\)'
>>> re.split(regex_pattern, example)
['st', 'ckoverflow ', ' is ', 'wesome', " isn't it?"]

re.escape allows to build the pattern automatically and have the delimiters escaped nicely.

Here's this solution as a function for your copy-pasting pleasure:

def split(delimiters, string, maxsplit=0):
    import re
    regex_pattern = '|'.join(map(re.escape, delimiters))
    return re.split(regex_pattern, string, maxsplit)

If you're going to split often using the same delimiters, compile your regular expression beforehand like described and use RegexObject.split.

If you'd like to leave the original delimiters in the string, you can change the regex to use a lookbehind assertion instead:

>>> import re
>>> delimiters = "a", "...", "(c)"
>>> example = "stackoverflow (c) is awesome... isn't it?"
>>> regex_pattern = '|'.join('(?<={})'.format(re.escape(delim)) for delim in delimiters)
>>> regex_pattern
'(?<=a)|(?<=\\.\\.\\.)|(?<=\\(c\\))'
>>> re.split(regex_pattern, example)
['sta', 'ckoverflow (c)', ' is a', 'wesome...', " isn't it?"]

(replace ?<= with ?= to attach the delimiters to the righthand side, instead of left)

Paul · Accepted Answer · 2013-01-09 10:22:43Z

98

In response to Jonathan's answer above, this only seems to work for certain delimiters. For example:

>>> a='Beautiful, is; better*than\nugly'
>>> import re
>>> re.split('; |, |\*|\n',a)
['Beautiful', 'is', 'better', 'than', 'ugly']

>>> b='1999-05-03 10:37:00'
>>> re.split('- :', b)
['1999-05-03 10:37:00']

By putting the delimiters in square brackets it seems to work more effectively.

>>> re.split('[- :]', b)
['1999', '05', '03', '10', '37', '00']

answered Jan 9, 2013 at 10:22

Paul

1,9621 gold badge19 silver badges28 bronze badges

2 Comments

alldayremix Over a year ago

It works for all the delimiters you specify. A regex of - : matches exactly - : and thus won't split the date/time string. A regex of [- :] matches -, <space>, or : and thus splits the date/time string. If you want to split only on - and : then your regex should be either [-:] or -|:, and if you want to split on -, <space> and : then your regex should be either [- :] or -| |:.

Paul Over a year ago

@alldayremix I see my mistake: I missed the fact that your regex contains the OR |. I blindly identified it as a desired separator.

Jochen Ritzel · Accepted Answer · 2011-02-14 23:52:13Z

40

This is how the regex look like:

import re
# "semicolon or (a comma followed by a space)"
pattern = re.compile(r";|, ")

# "(semicolon or a comma) followed by a space"
pattern = re.compile(r"[;,] ")

print pattern.split(text)

answered Feb 14, 2011 at 23:52

Jochen Ritzel

108k33 gold badges205 silver badges196 bronze badges

Collectives™ on Stack Overflow

Split string with multiple delimiters in Python [duplicate]

5 Answers 5

16 Comments

8 Comments

Comments

2 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

16 Comments

8 Comments

Comments

2 Comments

Comments

Linked

Related