17

In Python, I'd like to split a string using a list of separators. The separators could be either commas or semicolons. Whitespace should be removed unless it is in the middle of non-whitespace, non-separator characters, in which case it should be preserved.

Test case 1: ABC,DEF123,GHI_JKL,MN OP
Test case 2: ABC;DEF123;GHI_JKL;MN OP
Test case 3: ABC ; DEF123,GHI_JKL ; MN OP

Sounds like a case for regular expressions, which is fine, but if it's easier or cleaner to do it another way that would be even better.

Thanks!

4 Answers 4

30

This should be much faster than regex and you can pass a list of separators as you wanted:

def split(txt, seps):
    default_sep = seps[0]

    # we skip seps[0] because that's the default separator
    for sep in seps[1:]:
        txt = txt.replace(sep, default_sep)
    return [i.strip() for i in txt.split(default_sep)]

How to use it:

>>> split('ABC ; DEF123,GHI_JKL ; MN OP', (',', ';'))
['ABC', 'DEF123', 'GHI_JKL', 'MN OP']

Performance test:

import timeit
import re


TEST = 'ABC ; DEF123,GHI_JKL ; MN OP'
SEPS = (',', ';')


rsplit = re.compile("|".join(SEPS)).split
print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 1.6242462980007986

print(timeit.timeit(lambda: split(TEST, SEPS)))
# 1.3588597209964064

And with a much longer input string:

TEST = 100 * 'ABC ; DEF123,GHI_JKL ; MN OP , '

print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
# 130.67168392999884

print(timeit.timeit(lambda: split(TEST, SEPS)))
# 50.31940778599528
Sign up to request clarification or add additional context in comments.

9 Comments

On my machine, the second solution I gave is even faster for short strings.
Instead of having default_sep be a parameter, just use one of the seps. eg: default_sep = seps[0] and then change the for line to for sep in seps[1:]:.
This relies on the caller knowing in advance that there's a character (e.g. "|") that can never appear in the input. This is distaster-prone.
This comparison is flawed: it compiles the regex every time through the loop. If you properly compile the regex outside of the loop (r = re.compile(",|;")), the regex version is faster. It's also the clear, ordinary, flexible solution to this which everyone understands immediately, which is an even stronger argument than performance.
@blah238, @Joschua: @Glenn Maynard: On my machine: Joschua: 2.30, r=re.compile(...) in setup: 2.18, rs=re.compile(...).split in setup: 2.08. Further note: Joschua's method is O(SN) where S is the number of separators.
|
6

Using regular expressions, try

[s.strip() for s in re.split(",|;", string)]

or

[t.strip() for s in string.split(",") for t in s.split(";")]

without.

4 Comments

Rather do it through string's split() to avoid importing re, e.g. 'ABC,DEF123,GHI_JKL,MN OP'.split(',|;')
@macrog: Wouldn't this split the string at all verbatim occurrences of ",|;"?
But if you want to split at ,;. you have to add a for-loop for every character!
@Joshua: But the question states we only want to split at , and ;. And I would use the regex version anyway.
2

Taking the above answer, with your test cases, you want to use a regular expression, and one or more separation characters. In your case, the separation characters seem to be ',', '|', ';' and whitespace. Whitespace in python is '\w', so the comprehension is:

import re
list = [s for s in re.split("[,|;\W]+", string)]

I cannot reply to sven's answer above, but I split on one or more of the characters inside the brackets, and don't have to use the strip() method.

Yikes, I didn't read the question correctly... Sven's answer with the strip works; mine assumes the whitespace is another separation.

Comments

1
>>> re.split('\s*,\s*|\s*;\s*', 'a , b; cdf')
['a', 'b', 'cdf']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.