28

I have a word list like below. I want to split the list by .. Is there any better or useful code in Python 3?

a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
result = []
tmp = []
for elm in a:
    if elm is not '.':
        tmp.append(elm)
    else:
        result.append(tmp)
        tmp = []
print(result)
# result: [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]

Update

Add test cases to handle it correctly.

a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
b = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.', 'yes']
c = ['.', 'this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.', 'yes']
def split_list(list_data, split_word='.'):
    result = []
    sub_data = []
    for elm in list_data:
        if elm is not split_word:
            sub_data.append(elm)
        else:
            if len(sub_data) != 0:
                result.append(sub_data)
            sub_data = []
    if len(sub_data) != 0:
        result.append(sub_data)
    return result

print(split_list(a)) # [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]
print(split_list(b)) # [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice'], ['yes']]
print(split_list(c)) # [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice'], ['yes']]
4
  • 1
    It think there s a one-liner solution with no additional libraries that comes close to your speeds using list comprehension and string functions. Commented Dec 2, 2017 at 5:28
  • @ScottBoston I thought there was some useful functions :). But I'm happy to see many interesting answers. Commented Dec 2, 2017 at 13:41
  • 2
    You shouldn't use the is operator to compare strings BTW. Commented Dec 2, 2017 at 21:18
  • 1
    It appears you already split a string once. Your problem would be much simpler if your first split were by sentences. Commented Dec 3, 2017 at 1:43

6 Answers 6

28

Using itertools.groupby:

from itertools import groupby
a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
result = [
    list(g)
    for k,g in groupby(a,lambda x:x=='.')
    if not k
]
print (result)
#[['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]
Sign up to request clarification or add additional context in comments.

Comments

14

You can do this all with a "one-liner" using list comprehension and string functions join, split, strip, and no additional libraries.

a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
b = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.', 'yes']
c = ['.', 'this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.', 'yes']



In [5]: [i.strip().split(' ') for i in ' '.join(a).split('.') if len(i) > 0 ]
Out[5]: [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]

In [8]: [i.strip().split(' ') for i in ' '.join(b).split('.') if len(i) > 0 ]
Out[8]: [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice'], ['yes']]

In [9]: In [8]: [i.strip().split(' ') for i in ' '.join(c).split('.') if len(i) > 0 ]
Out[9]: [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice'], ['yes']]

@Craig has a simpler update:

[s.split() for s in ' '.join(a).split('.') if s]

5 Comments

Slightly simpler: [s.split() for s in ' '.join(a).split('.') if s]
@Craig Thanks! I hate when I over complicate and overthink things.
Oh, this is nice. But if there is a word which includes white space, join will break the original list. I mean like "New York". I may add such a test case. But this is really simple and nice. Thank you!
@jef It is challenge to break 'This is New York.' into 'This', 'is', 'New York'.
This method is broken if any element has a space or a dot.
8

Here's another way using only standard list operations (with no dependencies on other libraries!). First we find the split points and then we create sublists around them; notice that the first element is treated as a special case:

a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
indexes = [-1] + [i for i, x in enumerate(a) if x == '.']

[a[indexes[i]+1:indexes[i+1]] for i in range(len(indexes)-1)]
=> [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]

Comments

4

You can reconstruct the string using ' '.join and use regex:

import re
a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
new_s = [b for b in [re.split('\s', i) for i in re.split('\s*\.\s*', ' '.join(a))] if all(b)]

Output:

[['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]

1 Comment

Same comment as for @ScottBoston: This method is broken if any element has a space or a dot.
1

I couldn't help myself, just wanted to have fun with this great question:

import itertools

a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
b = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.', 'yes']
c = ['.', 'this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.', 'yes']

def split_dots(lst):

    dots = [0] + [i+1 for i, e in enumerate(lst) if e == '.']

    result = [list(itertools.takewhile(lambda x : x != '.', lst[dot:])) for dot in dots]

    return list(filter(lambda x : x, result))

print(split_dots(a)) # [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]
print(split_dots(b)) # [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice'], ['yes']]
print(split_dots(c)) # [['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice'], ['yes']]

Comments

1

This answer requires installing a 3rd party library: iteration_utilities1. The included split function makes solving this task straightforward:

>>> from iteration_utilities import split
>>> a = ['this', 'is', 'a', 'cat', '.', 'hello', '.', 'she', 'is', 'nice', '.']
>>> list(filter(None, split(a, '.', eq=True)))
[['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]

Instead of using the eq parameter you can also define a custom function where to split:

>>> list(filter(None, split(a, lambda x: x=='.')))
[['this', 'is', 'a', 'cat'], ['hello'], ['she', 'is', 'nice']]

In case you want to keep the '.'s you could also use the keep_before argument:

>>> list(filter(None, split(a, '.', eq=True, keep_before=True)))
[['this', 'is', 'a', 'cat', '.'], ['hello', '.'], ['she', 'is', 'nice', '.']]

Note that the library just makes it easier - it's easily possible (see the other answers) to accomplish this task without installing an additional library.

The filter can be removed if you don't expect '.' to appear at the beginning or end of your to-be-split list.


1 I'm the author of that library. It's available via pip or the conda-forge channel with conda.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.