Python - Splitting an array into two using an optimized for loop

Question

This is a followup question to a question I posted here, but it's a very different question, so I thought I would post it separately.

I have a Python script which reads an very large array, and I needed to optimize an operation on each element (see referenced SO question). I now need to split the output array into two separate arrays.

I have the code:

output = [True if (len(element_in_array) % 2) else False for element_in_array in master_list]

which outputs an array of length len(master_list) consisting of True or False, depending on if the length of element_in_array is odd or even. My problem is that I need to split master_list into two arrays: one array containing the element_in_array's that correspond to the True elements in output and another containing the element_in_array's corresponding to the False elements in output.

This can clearly be done with traditional array operators such as append, but I need this to be as optimized and as fast as possible. I have many millions of elements in my master_list, so is there a way to accomplish this without directly looping through master_list and using append to create two new arrays.

Any advice would be greatly appreciated. Thanks!

So you're appending all the trues, aka even, to the first list. It should be a single for loop, which means O(n), you really can not go faster than a linear time loop here. — Dylan Lawrence
– Dylan Lawrence, Commented Nov 26, 2013 at 19:52
If you have a very large array, can you use a NumPy array instead of a pure Python list? If so, you can probably do it in simpler code, which takes about 1/10th as long to run, and uses about 1/4th the storage. — abarnert
– abarnert, Commented Nov 26, 2013 at 19:56
As a side note, True if foo else False is simpler (and often faster) as bool(foo). — abarnert
– abarnert, Commented Nov 26, 2013 at 20:00

Ashwini Chaudhary · Accepted Answer · 2013-11-26 20:05:07Z

0

You can use itertools.compress:

>>> from itertools import compress, imap
>>> import operator
>>> lis = range(10)
>>> output = [random.choice([True, False]) for _ in xrange(10)]
>>> output
[True, True, False, False, False, False, False, False, False, False]
>>> truthy = list(compress(lis, output))
>>> truthy
[0, 1]
>>> falsy = list(compress(lis, imap(operator.not_,output)))
>>> falsy
[2, 3, 4, 5, 6, 7, 8, 9]

Go for NumPy if you want even faster solution, plus it also allows us to do array filtering based on boolean arrays:

>>> import numpy as np
>>> a = np.random.random(10)*10
>>> a
array([ 2.94518349,  0.09536957,  8.74605883,  4.05063779,  2.11192606,
        2.24215582,  7.02203768,  2.1267423 ,  7.6526713 ,  3.81429322])
>>> output = np.array([True, True, False, False, False, False, False, False, False, False])
>>> a[output]
array([ 2.94518349,  0.09536957])
>>> a[~output]
array([ 8.74605883,  4.05063779,  2.11192606,  2.24215582,  7.02203768,
        2.1267423 ,  7.6526713 ,  3.81429322])

Timing comparison:

>>> lis = range(1000)
>>> output = [random.choice([True, False]) for _ in xrange(1000)]
>>> a = np.random.random(1000)*100
>>> output_n = np.array(output)
>>> %timeit list(compress(lis, output))
10000 loops, best of 3: 44.9 us per loop
>>> %timeit a[output_n]
10000 loops, best of 3: 20.9 us per loop
>>> %timeit list(compress(lis, imap(operator.not_,output)))
1000 loops, best of 3: 150 us per loop
>>> %timeit a[~output_n]
10000 loops, best of 3: 28.7 us per loop

edited Nov 26, 2013 at 20:05

answered Nov 26, 2013 at 19:51

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

chepner Over a year ago

This still requires two passes over output, so benchmarking will be needed to see if this is faster than a single pass that appends elements to the appropriate list.

Gabe Over a year ago

This actually requires 3 passes -- one to determine which list each element is in, one to make the first list, and one to make the second list.

Brett Over a year ago

I've just been looking at numpy. Looks like the way to go.

abarnert · Accepted Answer · 2013-11-26 20:05:16Z

0

If you can use NumPy, this will be a lot simpler. And, as a bonus, it'll also be a lot faster, and it'll use a lot less memory to store your giant array. For example:

>>> import numpy as np
>>> import random
>>> # create an array of 1000 arrays of length 1-1000
>>> a = np.array([np.random.random(random.randint(1, 1000))
                  for _ in range(1000)])
>>> lengths = np.vectorize(len)(a)
>>> even_flags = lengths % 2 == 0
>>> evens, odds = a[even_flags], a[~even_flags]
>>> len(evens), len(odds)
(502, 498)

answered Nov 26, 2013 at 20:05

abarnert

368k54 gold badges626 silver badges692 bronze badges

Comments

chepner · Accepted Answer · 2013-11-26 20:06:25Z

You could try using the groupby function in itertools. The key function would be the function that determines if the length of an element is even or not. The iterator returned by groupby consists of key-value tuples, where key is a value returned by the key function (here, True or False) and the value is a sequence of items which all share the same key. Create a dictionary which maps a value returned by the key function to a list, and you can extend the appropriate list with a set of values from the initial iterator.

trues = []
falses = []
d = { True: trues, False: falses }
def has_even_length(element_in_array):
    return len(element_in_array) % 2 == 0

for k, v in itertools.groupby(master_list, has_even_length):
   d[k].extend(v)

The documentation for groupby says you typically want to make sure the list is sorted on the same key returned by the key function. In this case, it's OK to leave it unsorted; you'll just have more than things returned by the iterator returned by groupby, as there could be an a number of alternating true/false sets in the sequence.

Collectives™ on Stack Overflow

Python - Splitting an array into two using an optimized for loop

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related