Python: Split string by list of separators

Question

In Python, I'd like to split a string using a list of separators. The separators could be either commas or semicolons. Whitespace should be removed unless it is in the middle of non-whitespace, non-separator characters, in which case it should be preserved.

Test case 1: ABC,DEF123,GHI_JKL,MN OP
Test case 2: ABC;DEF123;GHI_JKL;MN OP
Test case 3: ABC ; DEF123,GHI_JKL ; MN OP

Sounds like a case for regular expressions, which is fine, but if it's easier or cleaner to do it another way that would be even better.

Thanks!

Joschua · Accepted Answer · 2019-09-11 20:22:11Z

30

This should be much faster than regex and you can pass a list of separators as you wanted:

def split(txt, seps):
    default_sep = seps[0]

    # we skip seps[0] because that's the default separator
    for sep in seps[1:]:
        txt = txt.replace(sep, default_sep)
    return [i.strip() for i in txt.split(default_sep)]

How to use it:

>>> split('ABC ; DEF123,GHI_JKL ; MN OP', (',', ';'))
['ABC', 'DEF123', 'GHI_JKL', 'MN OP']

Performance test:

import timeit
import re


TEST = 'ABC ; DEF123,GHI_JKL ; MN OP'
SEPS = (',', ';')


rsplit = re.compile("|".join(SEPS)).split
print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 1.6242462980007986

print(timeit.timeit(lambda: split(TEST, SEPS)))
# 1.3588597209964064

And with a much longer input string:

TEST = 100 * 'ABC ; DEF123,GHI_JKL ; MN OP , '

print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 130.67168392999884

print(timeit.timeit(lambda: split(TEST, SEPS)))
# 50.31940778599528

edited Sep 11, 2019 at 20:22

answered Jan 14, 2011 at 23:35

Joschua

6,0745 gold badges37 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Sven Marnach Over a year ago

On my machine, the second solution I gave is even faster for short strings.

Laurence Gonsalves Over a year ago

Instead of having default_sep be a parameter, just use one of the seps. eg: default_sep = seps[0] and then change the for line to for sep in seps[1:]:.

John Machin Over a year ago

This relies on the caller knowing in advance that there's a character (e.g. "|") that can never appear in the input. This is distaster-prone.

Glenn Maynard Over a year ago

This comparison is flawed: it compiles the regex every time through the loop. If you properly compile the regex outside of the loop (r = re.compile(",|;")), the regex version is faster. It's also the clear, ordinary, flexible solution to this which everyone understands immediately, which is an even stronger argument than performance.

John Machin Over a year ago

@blah238, @Joschua: @Glenn Maynard: On my machine: Joschua: 2.30, r=re.compile(...) in setup: 2.18, rs=re.compile(...).split in setup: 2.08. Further note: Joschua's method is O(SN) where S is the number of separators.

|

Sven Marnach · Accepted Answer · 2011-01-14 23:27:49Z

6

Using regular expressions, try

[s.strip() for s in re.split(",|;", string)]

or

[t.strip() for s in string.split(",") for t in s.split(";")]

without.

answered Jan 14, 2011 at 23:27

Sven Marnach

608k123 gold badges969 silver badges866 bronze badges

4 Comments

moinudin Over a year ago

Rather do it through string's split() to avoid importing re, e.g. 'ABC,DEF123,GHI_JKL,MN OP'.split(',|;')

Sven Marnach Over a year ago

@macrog: Wouldn't this split the string at all verbatim occurrences of ",|;"?

Joschua Over a year ago

But if you want to split at ,;. you have to add a for-loop for every character!

Sven Marnach Over a year ago

@Joshua: But the question states we only want to split at , and ;. And I would use the regex version anyway.

tmarthal · Accepted Answer · 2011-01-14 23:39:53Z

2

Taking the above answer, with your test cases, you want to use a regular expression, and one or more separation characters. In your case, the separation characters seem to be ',', '|', ';' and whitespace. Whitespace in python is '\w', so the comprehension is:

import re
list = [s for s in re.split("[,|;\W]+", string)]

I cannot reply to sven's answer above, but I split on one or more of the characters inside the brackets, and don't have to use the strip() method.

Yikes, I didn't read the question correctly... Sven's answer with the strip works; mine assumes the whitespace is another separation.

answered Jan 14, 2011 at 23:39

tmarthal

1,52820 silver badges28 bronze badges

Comments

Raph Levien · Accepted Answer · 2011-01-14 23:36:40Z

1

>>> re.split('\s*,\s*|\s*;\s*', 'a , b; cdf')
['a', 'b', 'cdf']

answered Jan 14, 2011 at 23:36

Raph Levien

5,23828 silver badges24 bronze badges

Collectives™ on Stack Overflow

Python: Split string by list of separators

4 Answers 4

9 Comments

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

9 Comments

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related