Count how many times each row is present in numpy.array

Question

I am trying to count a number each row shows in a np.array, for example:

import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [9, 7, 5, 3, 2, 1],
                     [1, 1, 1, 0, 0, 0], 
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [1, 1, 1, 1, 1, 0]])

Row [1, 2, 0, 1, 1, 1] shows up 3 times.

A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:

from collections import Counter
def row_counter(my_array):
    list_of_tups = [tuple(ele) for ele in my_array]
    return Counter(list_of_tups)

Which yields:

In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})

However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.

I like this problem! You may be able to use np.lexsort to your advantage, but I am not sure whether the collection after sorting can be done fast enough. — eickenberg
– eickenberg, Commented Nov 18, 2014 at 20:49

Yuya Takashina · Accepted Answer · 2018-09-28 09:57:56Z

15

I think just specifying axis in np.unique gives what you need.

import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)

Note: this feature is available only in numpy>=1.13.0.

edited Sep 28, 2018 at 9:57

answered Sep 28, 2018 at 8:45

Yuya Takashina

6327 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jwalton Over a year ago

This seems to be the best solution for numpy>=1.13.0.

Community · Accepted Answer · 2017-05-23 12:13:57Z

13

You can use the answer to this other question of yours to get the counts of the unique items.

In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:

>>> my_array
array([[1, 2, 0, 1, 1, 1],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 2, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

In earlier versions, you can do it as:

>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

edited May 23, 2017 at 12:13

CommunityBot

11 silver badge

answered Nov 19, 2014 at 2:15

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

3 Comments

hpaulj Over a year ago

The last reshape can be simplified a bit with: unq.view((my_array.dtype, my_array.shape[1])); it uses the same sort of multi-item dtype as the first view.

endolith Over a year ago

Does this have a benefit over np.unique with axis parameter? (Which may have been added after this question was written)

Charlie Vagg Over a year ago

I keep getting a type error "This axis arguement is unique is not supported for dtype object". How can I fix this?

Alex Riley · Accepted Answer · 2015-12-06 21:53:48Z

5

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)

Here's a short NumPy way to count how many times each row appears in an array:

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])

This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.

edited Dec 6, 2015 at 21:53

answered Nov 18, 2014 at 18:13

Alex Riley

178k46 gold badges274 silver badges247 bronze badges

1 Comment

gboffi Over a year ago

With n=np.arange(my_array.shape[0]) one can obtain a nice result also by writing [n[ui] for ui in (my_array[:,np.newaxis,:] == my_array).all(axis=2)]... Nice answer, I've have already half understood it, but what puzzles me it's how you come out with the solution!

Bob Haffner · Accepted Answer · 2014-11-18 17:32:09Z

3

A pandas approach might look like this

import pandas as pd

df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()

Note: supplying column names is not necessary

answered Nov 18, 2014 at 17:32

Bob Haffner

8,5231 gold badge40 silver badges44 bronze badges

4 Comments

JD Long Over a year ago

i have no idea why this got downvoted. This is a good example of how to do this using Pandas.

Akavall Over a year ago

Can you show how you would do it without supplying columns names?

Bob Haffner Over a year ago

Just omit the columns arg in the DataFrame() and use [0,1,2,3,4,5] in the group by(). [0,1,2,3,4,5] will the default column names that pandas assigns.

Akavall Over a year ago

Got it! Thanks, I was trying to pass np.arange(6), and that was not giving me what I wanted, but passing a list works. Thanks.

elyase · Accepted Answer · 2014-11-18 18:36:38Z

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:

A = np.random.rand(5, 10000)

%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop

%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop

%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop

%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop

The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:

numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop

If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.

Edit: Added numpy benchmarks from @acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.

Eelco Hoogendoorn · Accepted Answer · 2016-04-02 19:28:29Z

0

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)

import numpy_indexed as npi
npi.count(my_array)

answered Apr 2, 2016 at 19:28

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Collectives™ on Stack Overflow

Count how many times each row is present in numpy.array

6 Answers 6

1 Comment

3 Comments

1 Comment

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

3 Comments

1 Comment

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related