Conditional selection in array

Question

I have following list with arrays:

[array([10,  1,  7,  3]),
 array([ 0, 14, 12, 13]),
 array([ 3, 10,  7,  8]),
 array([7, 5]),
 array([ 5, 12,  3]),
 array([14,  8, 10])]

What I want is to mark rows as "1" or "0", conditional on whether the row matches "10" AND "7" OR "10" AND "3".

np.where(output== 10 & output == 7 ) | (output == 10 & output == 3 ) | (output == 10 & output == 8 ), 1, 0)

returns

array(0)

What's the correct syntax to get into the array of the array?

Expected output:

[ 1, 0, 1, 0, 0, 1 ]

Note: What is output? After training an CountVectorizer/LDA topic classifier in Scikit, the following script assigns topic probabilities to new documents. Topics above the threshold of 0.2 are then stored in an array.

def sortthreshold(x, thresh):
    idx = np.arange(x.size)[x > thresh]
    return idx[np.argsort(x[idx])]

output = []
for x in newdoc:
    y = lda.transform(bowvectorizer.transform([x]))
    output.append(sortthreshold(y[0], 0.2))

Thanks!

That looks like a plain Python list of Numpy arrays. What's output? — PM 2Ring
– PM 2Ring, Commented Aug 4, 2018 at 9:12
Output is an array that is created out of an LDA topic model. The numbers in the array correspond to topics with a topic loading higher than a given threshold. — Christopher
– Christopher, Commented Aug 4, 2018 at 9:14
We (probably) don't need to see the original function that creates output. But we do need you to make the code you've shown us unambiguous and self-consistent. You keep calling output an array, but it looks like a list. And according to the code you just added, output is a list, not an array. — PM 2Ring
– PM 2Ring, Commented Aug 4, 2018 at 9:24

PM 2Ring · Accepted Answer · 2018-08-04 17:37:49Z

Your input data is a plain Python list of Numpy arrays of unequal length, thus it can't be simply converted to a 2D Numpy array, and so it can't be directly processed by Numpy. But it can be process using the usual Python list processing tools.

Here's a list comprehension that uses numpy.isin to test if a row contains any of (3, 7, 8). We first use simple == testing to see if the row contains 10, and only call isin if it does so; the Python and operator will not evaluate its second operand if the first operand is false-ish.

We use np.any to see if any row item passes each test. np.any returns a Boolean value of False or True, but we can pass those values to int to convert them to 0 or 1.

import numpy as np

data = [
    np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
    np.array([3, 10, 7, 8]), np.array([7, 5]),
    np.array([5, 12, 3]), np.array([14, 8, 10]),
]

mask = np.array([3, 7, 8])
result = [int(np.any(row==10) and np.any(np.isin(row, mask)))
    for row in data]

print(result)

output

[1, 0, 1, 0, 0, 1]

I've just performed some timeit tests. Curiously, Reblochon Masque's code is faster on the data given in the question, presumably because of the short-circuiting behaviour of plain Python any, and & or. Also, it appears that numpy.in1d is faster than numpy.isin, even though the docs recommend using the latter in new code.

Here's a new version that's about 10% slower than Reblochon's.

mask = np.array([3, 7, 8])
result = [int(any(row==10) and any(np.in1d(row, mask)))
    for row in data]

Of course, the true speed on large amounts of real data may vary from what my tests indicate. And time may not be an issue: even on my slow old 32 bit single core 2GHz machine I can process the data in the question almost 3000 times in one second.

hpaulj has suggested an even faster way. Here's some timeit test info, comparing the various versions. These tests were performed on my old machine, YMMV.

import numpy as np
from timeit import Timer

the_data = [
    np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
    np.array([3, 10, 7, 8]), np.array([7, 5]),
    np.array([5, 12, 3]), np.array([14, 8, 10]),
]

def rebloch0(data):
    result = []
    for output in data:
        result.append(1 if np.where((any(output == 10) and any(output == 7)) or
            (any(output == 10) and any(output == 3)) or
            (any(output == 10) and any(output == 8)), 1, 0) == True else 0)
    return result

def rebloch1(data):
    result = []
    for output in data:
        result.append(1 if np.where((any(output == 10) and any(output == 7)) or
            (any(output == 10) and any(output == 3)) or
            (any(output == 10) and any(output == 8)), 1, 0) else 0)
    return result

def pm2r0(data):
    mask = np.array([3, 7, 8])
    return [int(np.any(row==10) and np.any(np.isin(row, mask)))
        for row in data]

def pm2r1(data):
    mask = np.array([3, 7, 8])
    return [int(any(row==10) and any(np.in1d(row, mask)))
        for row in data]

def hpaulj0(data):
    mask=np.array([3, 7, 8])
    return [int(any(row==10) and any((row[:, None]==mask).flat))
        for row in data]

def hpaulj1(data, mask=np.array([3, 7, 8])):
    return [int(any(row==10) and any((row[:, None]==mask).flat))
        for row in data]

functions = (
    rebloch0,
    rebloch1,
    pm2r0,
    pm2r1,
    hpaulj0,
    hpaulj1,
)

# Verify that all functions give the same result
for func in functions:
    print('{:8}: {}'.format(func.__name__, func(the_data)))
print()

def time_test(loops, data):
    timings = []
    for func in functions:
        t = Timer(lambda: func(data))
        result = sorted(t.repeat(3, loops))
        timings.append((result, func.__name__))
    timings.sort()
    for result, name in timings:
        print('{:8}: {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
    print()

time_test(1000, the_data)

typical output

rebloch0: [1, 0, 1, 0, 0, 1]
rebloch1: [1, 0, 1, 0, 0, 1]
pm2r0   : [1, 0, 1, 0, 0, 1]
pm2r1   : [1, 0, 1, 0, 0, 1]
hpaulj0 : [1, 0, 1, 0, 0, 1]
hpaulj1 : [1, 0, 1, 0, 0, 1]

hpaulj1 : 0.140421, 0.154910, 0.156105
hpaulj0 : 0.154224, 0.154822, 0.167101
rebloch1: 0.281700, 0.282764, 0.284599
rebloch0: 0.339693, 0.359127, 0.375715
pm2r1   : 0.367677, 0.368826, 0.371599
pm2r0   : 0.626043, 0.628232, 0.670199

Nice work, hpaulj!

A blending of these ideas is even faster: int(any(arr==10) and any((arr[:,None]==[3,7,8]).flat))

Reblochon Masque · Accepted Answer · 2018-08-04 09:39:55Z

1

You need to use np.any combined with np.where, and avoid using | and & which are binary operators in python.

import numpy as np

a = [np.array([10,  1,  7,  3]),
     np.array([ 0, 14, 12, 13]),
     np.array([ 3, 10,  7,  8]),
     np.array([7, 5]),
     np.array([ 5, 12,  3]),
     np.array([14,  8, 10])]

for output in a:
    print(np.where(((any(output == 10) and any(output == 7))) or 
                   (any(output == 10) and any(output == 3)) or
                   (any(output == 10) and any(output == 8 )), 1, 0))

output:

If you want it as a list as the edited question shows:

result = []
for output in a:
    result.append(1 if np.where(((any(output == 10) and any(output == 7))) or 
                   (any(output == 10) and any(output == 3)) or
                   (any(output == 10) and any(output == 8 )), 1, 0) == True else 0)

result

result:

[1, 0, 1, 0, 0, 1]

edited Aug 4, 2018 at 9:39

answered Aug 4, 2018 at 9:34

Reblochon Masque

37k10 gold badges60 silver badges85 bronze badges

8 Comments

PM 2Ring Over a year ago

FWIW, in Numpy, & and | can be used for logical operations. However, unlike and and or they do not short-circuit.

Reblochon Masque Over a year ago

I did not know that about & and | with numpy, thank you @PM2Ring

PM 2Ring Over a year ago

No worries. It is a little surprising. Of course, Numpy can't use the traditional C operators && and || because the Python parser would reject them.

PM 2Ring Over a year ago

I just did some timeit tests. Your code is nearly twice as fast as my original version on the OP data.

PM 2Ring Over a year ago

hpaulj made a suggestion that really speeds things up. I've added a timeit test to my answer.

|

Collectives™ on Stack Overflow

Conditional selection in array

2 Answers 2

2 Comments

output:

result:

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

output:

result:

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related