2

How to get occurrence counts for for the elements of a float array?. If the array is [-1,2,3,-1,3,4,4,4,4,4],
the result should be [2,1,2,5], not necessarily in that order, and the mapping from counts to the elements that are counted is not needed, only counts matter.

numpy.histogram would do something similar, but it must use bins, which requires precomputing bin-size to separate the elements and can also create unnecessarily many empty bins.

This can also be done manually with hashing or sorting, but it seems there should be a fast, one-shot way without python-level loops.

Thanks!

Edit:

I tried the solutions suggested at the time of writing and thought I'd share the results as they are somewhat unexpected. What I did not mention originally is that the flow works with rather small lists, but the operation is invoked millions of times, which is somewhat a cornercase.

The test and its printout are below. histogramize1 is my original function whose performance I wanted to improve. It is by x2 faster then the second fastest, and it would be interesting to know why.

import numpy as np
from collections import Counter
from timeit import timeit


def histogramize1(X):
    cnts = {}
    for x in X:
        if x in cnts:
            cnts[x] += 1
        else:
            cnts[x] = 1
    lst = [ v for k,v in cnts.iteritems() ]

    lX = len(X)
    return [ float(x)/lX for x in lst ]


def histogramize2(X):

    ua,uind= np.unique(X,return_inverse=True)
    lX = len(X)    
    res = [float(x)/lX for x in np.bincount(uind)]

    return res


def histogramize3(X):
    counts = Counter(X)
    lX = len(X)
    res = [float(x)/lX for x in counts.viewvalues()]
    return res

def histogramize4(X):
    lX = len(X)
    return [float(X.count(i))/lX for i in np.unique(X)]

if __name__ == '__main__':

    lst0 = [-1,2,3,-1,3,4,4,4,4,4]
    lst = lst0 + lst0 + lst0 + lst0

    num = 100000
    print timeit("histogramize1(lst)",setup="from __main__ import histogramize1, lst",number=num)
    print timeit("histogramize2(lst)",setup="from __main__ import histogramize2, lst",number=num)
    print timeit("histogramize3(lst)",setup="from __main__ import histogramize3, lst",number=num)
    print timeit("histogramize4(lst)",setup="from __main__ import histogramize4, lst",number=num)

This prints:

1.35243415833

10.0806729794

2.89171504974

15.5577590466

5
  • @JonClements - There's one additional wrinkle, though... bincount expects non-negative integers. The OP will need to numpy.bincount(x - x.min()) or something similar. bincount will also return 0 in place of any elements that are "skipped" (e.g. if the OP's example had 5's in place of the 4's, the returned result would be [2, 1, 2, 0, 5], telling you that there are no 4's.) Commented Aug 2, 2013 at 12:48
  • @JoeKington That only occurred to me shortly after posting - hence the removal of my comment - but thanks for taking the time to explain out why numpy.bincount isn't immediately as obvious a solution as one first thinks ;) Commented Aug 2, 2013 at 12:57
  • 1
    This is a dangerous idea... Floating point arithmetic is inherently inexact, and for instance 2./3. == 1. - 1./3. returns False on my system. Unless all your floats have been generated in the exact same way, you cannot count on two values that should be the same actually being so. Commented Aug 2, 2013 at 16:19
  • @Jaime numpy.round/numpy.around/numpy.round_ work just fine for that. Commented Aug 2, 2013 at 17:03
  • @JAB But you have to round your values before you get into counting them, which is something no one seemed to care, happily demonstrating solutions by running them on ints... Commented Aug 2, 2013 at 17:53

3 Answers 3

5

For Python 2.7+:

>>> from collections import Counter
>>> counts = Counter([-1,2,3,-1,3,4,4,4,4,4])
>>> counts.viewvalues() # counts.values() in Python 3+
dict_values([1, 2, 5, 2])

http://docs.python.org/library/collections.html#collections.Counter (There are implementations for 2.4 and 2.5 if you're stuck with older versions, though.)

And since Counter is subclassed from dict, you can get the values that are counted if you ever need them. counts.viewitems() (2.7) or counts.items() (3+) will give you an iterable mapping.

Sign up to request clarification or add additional context in comments.

Comments

4

If you do want a numpy solution:

>>> a=np.array( [-1,2,3,-1,3,4,4,4,4,4])
>>> ua,uind=np.unique(a,return_inverse=True)

#This returns the unique values and indices of those values.
>>> ua
array([-1,  2,  3,  4])
>>> uind
array([0, 1, 2, 0, 2, 3, 3, 3, 3, 3])

>>> np.bincount(uind)
array([2, 1, 2, 5])

This has the additional benefit of showing what count goes with what number.

A bit over twice as fast for small arrays to boot:

import numpy as np
from collections import Counter

a=np.random.randint(0,100,(500))
alist=a.tolist()

In [27]: %timeit  Counter(alist).viewvalues()
1000 loops, best of 3: 209 us per loop

In [28]: %timeit ua,uind=np.unique(a,return_inverse=True);np.bincount(uind)
10000 loops, best of 3: 85.8 us per loop

Comments

0

Not sure whether this is the most elegan solution, but you could use this oneliner:

import numpy
aa = [-1,2,3,-1,3,4,4,4,4,4]
histogr = [aa.count(i) for i in numpy.unique(aa)]

1 Comment

It is short indeed! However, this would produce quadratic runtime as I believe count is implemented by raw iteration over the array. Could be useful when number of unique elements is known to be small.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.