19

I am trying to count a number each row shows in a np.array, for example:

import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [9, 7, 5, 3, 2, 1],
                     [1, 1, 1, 0, 0, 0], 
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [1, 1, 1, 1, 1, 0]])

Row [1, 2, 0, 1, 1, 1] shows up 3 times.

A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:

from collections import Counter
def row_counter(my_array):
    list_of_tups = [tuple(ele) for ele in my_array]
    return Counter(list_of_tups)

Which yields:

In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})

However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.

1
  • I like this problem! You may be able to use np.lexsort to your advantage, but I am not sure whether the collection after sorting can be done fast enough. Commented Nov 18, 2014 at 20:49

6 Answers 6

15

I think just specifying axis in np.unique gives what you need.

import numpy as np
unq, cnt = np.unique(my_array, axis=0, return_counts=True)

Note: this feature is available only in numpy>=1.13.0.

Sign up to request clarification or add additional context in comments.

1 Comment

This seems to be the best solution for numpy>=1.13.0.
13

You can use the answer to this other question of yours to get the counts of the unique items.

In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:

>>> my_array
array([[1, 2, 0, 1, 1, 1],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 2, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

In earlier versions, you can do it as:

>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

3 Comments

The last reshape can be simplified a bit with: unq.view((my_array.dtype, my_array.shape[1])); it uses the same sort of multi-item dtype as the first view.
Does this have a benefit over np.unique with axis parameter? (Which may have been added after this question was written)
I keep getting a type error "This axis arguement is unique is not supported for dtype object". How can I fix this?
5

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)

Here's a short NumPy way to count how many times each row appears in an array:

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])

This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.

1 Comment

With n=np.arange(my_array.shape[0]) one can obtain a nice result also by writing [n[ui] for ui in (my_array[:,np.newaxis,:] == my_array).all(axis=2)]... Nice answer, I've have already half understood it, but what puzzles me it's how you come out with the solution!
3

A pandas approach might look like this

import pandas as pd

df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()

Note: supplying column names is not necessary

4 Comments

i have no idea why this got downvoted. This is a good example of how to do this using Pandas.
Can you show how you would do it without supplying columns names?
Just omit the columns arg in the DataFrame() and use [0,1,2,3,4,5] in the group by(). [0,1,2,3,4,5] will the default column names that pandas assigns.
Got it! Thanks, I was trying to pass np.arange(6), and that was not giving me what I wanted, but passing a list works. Thanks.
3

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:

A = np.random.rand(5, 10000)

%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop

%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop

%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop

%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop

The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:

numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop

If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.

Edit: Added numpy benchmarks from @acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.

Comments

0

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)

import numpy_indexed as npi
npi.count(my_array)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.