1

I have following list with arrays:

[array([10,  1,  7,  3]),
 array([ 0, 14, 12, 13]),
 array([ 3, 10,  7,  8]),
 array([7, 5]),
 array([ 5, 12,  3]),
 array([14,  8, 10])]

What I want is to mark rows as "1" or "0", conditional on whether the row matches "10" AND "7" OR "10" AND "3".

np.where(output== 10 & output == 7 ) | (output == 10 & output == 3 ) | (output == 10 & output == 8 ), 1, 0)

returns

array(0)

What's the correct syntax to get into the array of the array?

Expected output:

[ 1, 0, 1, 0, 0, 1 ]

Note: What is output? After training an CountVectorizer/LDA topic classifier in Scikit, the following script assigns topic probabilities to new documents. Topics above the threshold of 0.2 are then stored in an array.

def sortthreshold(x, thresh):
    idx = np.arange(x.size)[x > thresh]
    return idx[np.argsort(x[idx])]

output = []
for x in newdoc:
    y = lda.transform(bowvectorizer.transform([x]))
    output.append(sortthreshold(y[0], 0.2))

Thanks!

11
  • 2
    by mark did you mean to replace the value as 0 or 1? Commented Aug 4, 2018 at 9:03
  • 1
    That looks like a plain Python list of Numpy arrays. What's output? Commented Aug 4, 2018 at 9:12
  • Output is an array that is created out of an LDA topic model. The numbers in the array correspond to topics with a topic loading higher than a given threshold. Commented Aug 4, 2018 at 9:14
  • 2
    How come [14, 8, 10] matches? It has a 10, but no 7 or 3. Commented Aug 4, 2018 at 9:19
  • 1
    We (probably) don't need to see the original function that creates output. But we do need you to make the code you've shown us unambiguous and self-consistent. You keep calling output an array, but it looks like a list. And according to the code you just added, output is a list, not an array. Commented Aug 4, 2018 at 9:24

2 Answers 2

2

Your input data is a plain Python list of Numpy arrays of unequal length, thus it can't be simply converted to a 2D Numpy array, and so it can't be directly processed by Numpy. But it can be process using the usual Python list processing tools.

Here's a list comprehension that uses numpy.isin to test if a row contains any of (3, 7, 8). We first use simple == testing to see if the row contains 10, and only call isin if it does so; the Python and operator will not evaluate its second operand if the first operand is false-ish.

We use np.any to see if any row item passes each test. np.any returns a Boolean value of False or True, but we can pass those values to int to convert them to 0 or 1.

import numpy as np

data = [
    np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
    np.array([3, 10, 7, 8]), np.array([7, 5]),
    np.array([5, 12, 3]), np.array([14, 8, 10]),
]

mask = np.array([3, 7, 8])
result = [int(np.any(row==10) and np.any(np.isin(row, mask)))
    for row in data]

print(result)

output

[1, 0, 1, 0, 0, 1] 

I've just performed some timeit tests. Curiously, Reblochon Masque's code is faster on the data given in the question, presumably because of the short-circuiting behaviour of plain Python any, and & or. Also, it appears that numpy.in1d is faster than numpy.isin, even though the docs recommend using the latter in new code.

Here's a new version that's about 10% slower than Reblochon's.

mask = np.array([3, 7, 8])
result = [int(any(row==10) and any(np.in1d(row, mask)))
    for row in data]

Of course, the true speed on large amounts of real data may vary from what my tests indicate. And time may not be an issue: even on my slow old 32 bit single core 2GHz machine I can process the data in the question almost 3000 times in one second.


hpaulj has suggested an even faster way. Here's some timeit test info, comparing the various versions. These tests were performed on my old machine, YMMV.

import numpy as np
from timeit import Timer

the_data = [
    np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
    np.array([3, 10, 7, 8]), np.array([7, 5]),
    np.array([5, 12, 3]), np.array([14, 8, 10]),
]

def rebloch0(data):
    result = []
    for output in data:
        result.append(1 if np.where((any(output == 10) and any(output == 7)) or
            (any(output == 10) and any(output == 3)) or
            (any(output == 10) and any(output == 8)), 1, 0) == True else 0)
    return result

def rebloch1(data):
    result = []
    for output in data:
        result.append(1 if np.where((any(output == 10) and any(output == 7)) or
            (any(output == 10) and any(output == 3)) or
            (any(output == 10) and any(output == 8)), 1, 0) else 0)
    return result

def pm2r0(data):
    mask = np.array([3, 7, 8])
    return [int(np.any(row==10) and np.any(np.isin(row, mask)))
        for row in data]

def pm2r1(data):
    mask = np.array([3, 7, 8])
    return [int(any(row==10) and any(np.in1d(row, mask)))
        for row in data]

def hpaulj0(data):
    mask=np.array([3, 7, 8])
    return [int(any(row==10) and any((row[:, None]==mask).flat))
        for row in data]

def hpaulj1(data, mask=np.array([3, 7, 8])):
    return [int(any(row==10) and any((row[:, None]==mask).flat))
        for row in data]

functions = (
    rebloch0,
    rebloch1,
    pm2r0,
    pm2r1,
    hpaulj0,
    hpaulj1,
)

# Verify that all functions give the same result
for func in functions:
    print('{:8}: {}'.format(func.__name__, func(the_data)))
print()

def time_test(loops, data):
    timings = []
    for func in functions:
        t = Timer(lambda: func(data))
        result = sorted(t.repeat(3, loops))
        timings.append((result, func.__name__))
    timings.sort()
    for result, name in timings:
        print('{:8}: {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
    print()

time_test(1000, the_data)

typical output

rebloch0: [1, 0, 1, 0, 0, 1]
rebloch1: [1, 0, 1, 0, 0, 1]
pm2r0   : [1, 0, 1, 0, 0, 1]
pm2r1   : [1, 0, 1, 0, 0, 1]
hpaulj0 : [1, 0, 1, 0, 0, 1]
hpaulj1 : [1, 0, 1, 0, 0, 1]

hpaulj1 : 0.140421, 0.154910, 0.156105
hpaulj0 : 0.154224, 0.154822, 0.167101
rebloch1: 0.281700, 0.282764, 0.284599
rebloch0: 0.339693, 0.359127, 0.375715
pm2r1   : 0.367677, 0.368826, 0.371599
pm2r0   : 0.626043, 0.628232, 0.670199

Nice work, hpaulj!

Sign up to request clarification or add additional context in comments.

2 Comments

A blending of these ideas is even faster: int(any(arr==10) and any((arr[:,None]==[3,7,8]).flat))
@hpaulj Wow, that's impressive!
1

You need to use np.any combined with np.where, and avoid using | and & which are binary operators in python.

import numpy as np

a = [np.array([10,  1,  7,  3]),
     np.array([ 0, 14, 12, 13]),
     np.array([ 3, 10,  7,  8]),
     np.array([7, 5]),
     np.array([ 5, 12,  3]),
     np.array([14,  8, 10])]

for output in a:
    print(np.where(((any(output == 10) and any(output == 7))) or 
                   (any(output == 10) and any(output == 3)) or
                   (any(output == 10) and any(output == 8 )), 1, 0))

output:

1
0
1
0
0
1

If you want it as a list as the edited question shows:

result = []
for output in a:
    result.append(1 if np.where(((any(output == 10) and any(output == 7))) or 
                   (any(output == 10) and any(output == 3)) or
                   (any(output == 10) and any(output == 8 )), 1, 0) == True else 0)

result

result:

[1, 0, 1, 0, 0, 1]

8 Comments

FWIW, in Numpy, & and | can be used for logical operations. However, unlike and and or they do not short-circuit.
I did not know that about & and | with numpy, thank you @PM2Ring
No worries. It is a little surprising. Of course, Numpy can't use the traditional C operators && and || because the Python parser would reject them.
I just did some timeit tests. Your code is nearly twice as fast as my original version on the OP data.
hpaulj made a suggestion that really speeds things up. I've added a timeit test to my answer.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.