8

I have the following type of arrays:

a = array([[1,1,1],
           [1,1,1],
           [1,1,1],
           [2,2,2],
           [2,2,2],
           [2,2,2],
           [3,3,0],
           [3,3,0],
           [3,3,0]])

I would like to count the number of occurrences of each type of array such as

[1,1,1]:3, [2,2,2]:3, and [3,3,0]: 3 

How could I achieve this in python? Is it possible without using a for loop and counting into a dictionary? It has to be fast and should take less than 0.1 seconds or so. I looked into Counter, numpy bincount, etc. But, those are for individual element not for an array.

Thanks.

1

5 Answers 5

2

If you don't mind mapping to tuples just to get the count you can use a Counter dict which runs in 28.5 µs on my machine using python3 which is well below your threshold:

In [5]: timeit Counter(map(tuple, a))
10000 loops, best of 3: 28.5 µs per loop

In [6]: c = Counter(map(tuple, a))

In [7]: c
Out[7]: Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
Sign up to request clarification or add additional context in comments.

1 Comment

No prob, it is actually faster again in python2 just using map
2

collections.Counter can do this conveniently, and almost like the example given.

>>> from collections import Counter
>>> c = Counter()
>>> for x in a:
...   c[tuple(x)] += 1
...
>>> c
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

This converts each sub-list to a tuple, which can be keys in a dictionary since they are immutable. Lists are mutable so can't be used as dict keys.

Why do you want to avoid using for loops?

And similar to @padraic-cunningham's much cooler answer:

>>> Counter(tuple(x) for x in a)
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})
>>> Counter(map(tuple, a))
Counter({(2, 2, 2): 3, (1, 1, 1): 3, (3, 3, 0): 3})

6 Comments

Python loops with numpy should be avoided as they can be many times slower than a numpy solution. Therefore, if you are using numpy and python loops you are likely "doing it wrong". In both the python answers you are copying the entire array to a much less compact datatype which is especially worrying.
@Ophion. do you have a faster numpy solution?
@PadraicCunningham Divakar's solution will be faster for reasonably sized problems. This problem has been asked, answered, and benchmarked many times in the past. I think the canonical answer is here.
@Ophion Re: "copying the entire array to a much less compact datatype". That's a good point. Though, I'd thought of taking it one further level of worrying by using a hash. But then also figured it to be unnecessarily complicated. One edit/addition I was going to put in the answer was retrieving the count of a certain elem, which is essentially a dict lookup in this case. So to print (for example) one of the counts, instead of print c[(3, 3, 0)], one would put print c.getcounts(3, 3, 0) (or override __getitem__ with the code for getcounts) and achieve a similar result.
Added benchmark for a decently large input array case in my solution. That benchmarking trend continued for medium sized inputs as well. I think @Ophion made sense when he talked about NumPy solutions being optimized for such cases that goes with its philosophy.
|
2

You could convert those rows to a 1D array using the elements as two-dimensional indices with np.ravel_multi_index. Then, use np.unique to give us the positions of the start of each unique row and also has an optional argument return_counts to give us the counts. Thus, the implementation would look something like this -

def unique_rows_counts(a):

    # Calculate linear indices using rows from a
    lidx = np.ravel_multi_index(a.T,a.max(0)+1 )

    # Get the unique indices and their counts
    _, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)

    # return the unique groups from a and their respective counts
    return a[unq_idx], counts

Sample run -

In [64]: a
Out[64]: 
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [2, 2, 2],
       [2, 2, 2],
       [2, 2, 2],
       [3, 3, 0],
       [3, 3, 0],
       [3, 3, 0]])

In [65]: unqrows, counts = unique_rows_counts(a)

In [66]: unqrows
Out[66]: 
array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 0]])
In [67]: counts
Out[67]: array([3, 3, 3])

Benchmarking

Assuming you are okay with either numpy arrays or collections as outputs, one can benchmark the solutions provided thus far, like so -

Function definitions:

import numpy as np
from collections import Counter

def unique_rows_counts(a):
    lidx = np.ravel_multi_index(a.T,a.max(0)+1 )
    _, unq_idx, counts = np.unique(lidx, return_index = True, return_counts=True)
    return a[unq_idx], counts

def map_Counter(a):
    return Counter(map(tuple, a))    

def forloop_Counter(a):      
    c = Counter()
    for x in a:
        c[tuple(x)] += 1
    return c

Timings:

In [53]: a = np.random.randint(0,4,(10000,5))

In [54]: %timeit map_Counter(a)
10 loops, best of 3: 31.7 ms per loop

In [55]: %timeit forloop_Counter(a)
10 loops, best of 3: 45.4 ms per loop

In [56]: %timeit unique_rows_counts(a)
1000 loops, best of 3: 1.72 ms per loop

19 Comments

Amazingly all the answers including the question got downvotes, makes one wonder!
@PadraicCunningham I could add runtime tests, but then OP might be looking to have a dict as output, so that won't be fair :)
I will have to wait for the downvoter to release their superior solution so ;)
@PadraicCunningham Not tested for correctness, but something like - base = a.min(0); lidx = np.ravel_multi_index(a.T- base.T[:,None],a.max(0)-base+1 ).
@PadraicCunningham Ah I see, well good to know it's still being used!
|
1

The numpy_indexed package (disclaimer: I am its author) contains efficient vectorized functionality for these kind of operations:

import numpy_indexed as npi
unique_rows, row_count = npi.count(a, axis=0)

Note that this works for arrays of any dimensionality or datatype.

2 Comments

Perfect answer but how can we make it as one output: example [[[1,1,9],10],[[1,1,0],2]]
zip(*npi.count(..) would give that; but that wouldnt be very numpythonic; alternatively you could make a structured array with a composite dtype and assign the result to that, if you insist. But more likely that not you will end up wit ha more efficient solution if you stick to the way numpy likes to organise things natively.
1

Since numpy-1.13.0, np.unique can be used with axis argument:

>>> np.unique(a, axis=0, return_counts=True)

(array([[1, 1, 1],
        [2, 2, 2],
        [3, 3, 0]]), array([3, 3, 3]))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.