0

This is a followup question to a question I posted here, but it's a very different question, so I thought I would post it separately.

I have a Python script which reads an very large array, and I needed to optimize an operation on each element (see referenced SO question). I now need to split the output array into two separate arrays.

I have the code:

output = [True if (len(element_in_array) % 2) else False for element_in_array in master_list]

which outputs an array of length len(master_list) consisting of True or False, depending on if the length of element_in_array is odd or even. My problem is that I need to split master_list into two arrays: one array containing the element_in_array's that correspond to the True elements in output and another containing the element_in_array's corresponding to the False elements in output.

This can clearly be done with traditional array operators such as append, but I need this to be as optimized and as fast as possible. I have many millions of elements in my master_list, so is there a way to accomplish this without directly looping through master_list and using append to create two new arrays.

Any advice would be greatly appreciated. Thanks!

3
  • So you're appending all the trues, aka even, to the first list. It should be a single for loop, which means O(n), you really can not go faster than a linear time loop here. Commented Nov 26, 2013 at 19:52
  • If you have a very large array, can you use a NumPy array instead of a pure Python list? If so, you can probably do it in simpler code, which takes about 1/10th as long to run, and uses about 1/4th the storage. Commented Nov 26, 2013 at 19:56
  • As a side note, True if foo else False is simpler (and often faster) as bool(foo). Commented Nov 26, 2013 at 20:00

3 Answers 3

0

You can use itertools.compress:

>>> from itertools import compress, imap
>>> import operator
>>> lis = range(10)
>>> output = [random.choice([True, False]) for _ in xrange(10)]
>>> output
[True, True, False, False, False, False, False, False, False, False]
>>> truthy = list(compress(lis, output))
>>> truthy
[0, 1]
>>> falsy = list(compress(lis, imap(operator.not_,output)))
>>> falsy
[2, 3, 4, 5, 6, 7, 8, 9]

Go for NumPy if you want even faster solution, plus it also allows us to do array filtering based on boolean arrays:

>>> import numpy as np
>>> a = np.random.random(10)*10
>>> a
array([ 2.94518349,  0.09536957,  8.74605883,  4.05063779,  2.11192606,
        2.24215582,  7.02203768,  2.1267423 ,  7.6526713 ,  3.81429322])
>>> output = np.array([True, True, False, False, False, False, False, False, False, False])
>>> a[output]
array([ 2.94518349,  0.09536957])
>>> a[~output]
array([ 8.74605883,  4.05063779,  2.11192606,  2.24215582,  7.02203768,
        2.1267423 ,  7.6526713 ,  3.81429322])

Timing comparison:

>>> lis = range(1000)
>>> output = [random.choice([True, False]) for _ in xrange(1000)]
>>> a = np.random.random(1000)*100
>>> output_n = np.array(output)
>>> %timeit list(compress(lis, output))
10000 loops, best of 3: 44.9 us per loop
>>> %timeit a[output_n]
10000 loops, best of 3: 20.9 us per loop
>>> %timeit list(compress(lis, imap(operator.not_,output)))
1000 loops, best of 3: 150 us per loop
>>> %timeit a[~output_n]
10000 loops, best of 3: 28.7 us per loop
Sign up to request clarification or add additional context in comments.

3 Comments

This still requires two passes over output, so benchmarking will be needed to see if this is faster than a single pass that appends elements to the appropriate list.
This actually requires 3 passes -- one to determine which list each element is in, one to make the first list, and one to make the second list.
I've just been looking at numpy. Looks like the way to go.
0

If you can use NumPy, this will be a lot simpler. And, as a bonus, it'll also be a lot faster, and it'll use a lot less memory to store your giant array. For example:

>>> import numpy as np
>>> import random
>>> # create an array of 1000 arrays of length 1-1000
>>> a = np.array([np.random.random(random.randint(1, 1000))
                  for _ in range(1000)])
>>> lengths = np.vectorize(len)(a)
>>> even_flags = lengths % 2 == 0
>>> evens, odds = a[even_flags], a[~even_flags]
>>> len(evens), len(odds)
(502, 498)

Comments

0

You could try using the groupby function in itertools. The key function would be the function that determines if the length of an element is even or not. The iterator returned by groupby consists of key-value tuples, where key is a value returned by the key function (here, True or False) and the value is a sequence of items which all share the same key. Create a dictionary which maps a value returned by the key function to a list, and you can extend the appropriate list with a set of values from the initial iterator.

trues = []
falses = []
d = { True: trues, False: falses }
def has_even_length(element_in_array):
    return len(element_in_array) % 2 == 0

for k, v in itertools.groupby(master_list, has_even_length):
   d[k].extend(v)

The documentation for groupby says you typically want to make sure the list is sorted on the same key returned by the key function. In this case, it's OK to leave it unsorted; you'll just have more than things returned by the iterator returned by groupby, as there could be an a number of alternating true/false sets in the sequence.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.