How to get occurrence counts for for the elements of a float array?.
If the array is
[-1,2,3,-1,3,4,4,4,4,4],
the result should be
[2,1,2,5],
not necessarily in that order, and the mapping from counts to the elements that are counted is not needed, only counts matter.
numpy.histogram would do something similar, but it must use bins, which requires precomputing bin-size to separate the elements and can also create unnecessarily many empty bins.
This can also be done manually with hashing or sorting, but it seems there should be a fast, one-shot way without python-level loops.
Thanks!
Edit:
I tried the solutions suggested at the time of writing and thought I'd share the results as they are somewhat unexpected. What I did not mention originally is that the flow works with rather small lists, but the operation is invoked millions of times, which is somewhat a cornercase.
The test and its printout are below. histogramize1 is my original function whose performance I wanted to improve. It is by x2 faster then the second fastest, and it would be interesting to know why.
import numpy as np
from collections import Counter
from timeit import timeit
def histogramize1(X):
cnts = {}
for x in X:
if x in cnts:
cnts[x] += 1
else:
cnts[x] = 1
lst = [ v for k,v in cnts.iteritems() ]
lX = len(X)
return [ float(x)/lX for x in lst ]
def histogramize2(X):
ua,uind= np.unique(X,return_inverse=True)
lX = len(X)
res = [float(x)/lX for x in np.bincount(uind)]
return res
def histogramize3(X):
counts = Counter(X)
lX = len(X)
res = [float(x)/lX for x in counts.viewvalues()]
return res
def histogramize4(X):
lX = len(X)
return [float(X.count(i))/lX for i in np.unique(X)]
if __name__ == '__main__':
lst0 = [-1,2,3,-1,3,4,4,4,4,4]
lst = lst0 + lst0 + lst0 + lst0
num = 100000
print timeit("histogramize1(lst)",setup="from __main__ import histogramize1, lst",number=num)
print timeit("histogramize2(lst)",setup="from __main__ import histogramize2, lst",number=num)
print timeit("histogramize3(lst)",setup="from __main__ import histogramize3, lst",number=num)
print timeit("histogramize4(lst)",setup="from __main__ import histogramize4, lst",number=num)
This prints:
1.35243415833
10.0806729794
2.89171504974
15.5577590466
bincountexpects non-negative integers. The OP will need tonumpy.bincount(x - x.min())or something similar.bincountwill also return0in place of any elements that are "skipped" (e.g. if the OP's example had 5's in place of the 4's, the returned result would be[2, 1, 2, 0, 5], telling you that there are no 4's.)numpy.bincountisn't immediately as obvious a solution as one first thinks ;)2./3. == 1. - 1./3.returnsFalseon my system. Unless all your floats have been generated in the exact same way, you cannot count on two values that should be the same actually being so.numpy.round/numpy.around/numpy.round_work just fine for that.