exact histogram of an array

Question

How to get occurrence counts for for the elements of a float array?. If the array is [-1,2,3,-1,3,4,4,4,4,4],
the result should be [2,1,2,5], not necessarily in that order, and the mapping from counts to the elements that are counted is not needed, only counts matter.

numpy.histogram would do something similar, but it must use bins, which requires precomputing bin-size to separate the elements and can also create unnecessarily many empty bins.

This can also be done manually with hashing or sorting, but it seems there should be a fast, one-shot way without python-level loops.

Thanks!

Edit:

I tried the solutions suggested at the time of writing and thought I'd share the results as they are somewhat unexpected. What I did not mention originally is that the flow works with rather small lists, but the operation is invoked millions of times, which is somewhat a cornercase.

The test and its printout are below. histogramize1 is my original function whose performance I wanted to improve. It is by x2 faster then the second fastest, and it would be interesting to know why.

import numpy as np
from collections import Counter
from timeit import timeit


def histogramize1(X):
    cnts = {}
    for x in X:
        if x in cnts:
            cnts[x] += 1
        else:
            cnts[x] = 1
    lst = [ v for k,v in cnts.iteritems() ]

    lX = len(X)
    return [ float(x)/lX for x in lst ]


def histogramize2(X):

    ua,uind= np.unique(X,return_inverse=True)
    lX = len(X)    
    res = [float(x)/lX for x in np.bincount(uind)]

    return res


def histogramize3(X):
    counts = Counter(X)
    lX = len(X)
    res = [float(x)/lX for x in counts.viewvalues()]
    return res

def histogramize4(X):
    lX = len(X)
    return [float(X.count(i))/lX for i in np.unique(X)]

if __name__ == '__main__':

    lst0 = [-1,2,3,-1,3,4,4,4,4,4]
    lst = lst0 + lst0 + lst0 + lst0

    num = 100000
    print timeit("histogramize1(lst)",setup="from __main__ import histogramize1, lst",number=num)
    print timeit("histogramize2(lst)",setup="from __main__ import histogramize2, lst",number=num)
    print timeit("histogramize3(lst)",setup="from __main__ import histogramize3, lst",number=num)
    print timeit("histogramize4(lst)",setup="from __main__ import histogramize4, lst",number=num)

This prints:

1.35243415833

10.0806729794

2.89171504974

15.5577590466

@JonClements - There's one additional wrinkle, though... bincount expects non-negative integers. The OP will need to numpy.bincount(x - x.min()) or something similar. bincount will also return 0 in place of any elements that are "skipped" (e.g. if the OP's example had 5's in place of the 4's, the returned result would be [2, 1, 2, 0, 5], telling you that there are no 4's.) — Joe Kington
– Joe Kington, Commented Aug 2, 2013 at 12:48
@JoeKington That only occurred to me shortly after posting - hence the removal of my comment - but thanks for taking the time to explain out why numpy.bincount isn't immediately as obvious a solution as one first thinks ;) — Jon Clements
– Jon Clements, Commented Aug 2, 2013 at 12:57
This is a dangerous idea... Floating point arithmetic is inherently inexact, and for instance 2./3. == 1. - 1./3. returns False on my system. Unless all your floats have been generated in the exact same way, you cannot count on two values that should be the same actually being so. — Jaime
– Jaime, Commented Aug 2, 2013 at 16:19
@Jaime numpy.round/numpy.around/numpy.round_ work just fine for that. — JAB
– JAB, Commented Aug 2, 2013 at 17:03
@JAB But you have to round your values before you get into counting them, which is something no one seemed to care, happily demonstrating solutions by running them on ints... — Jaime
– Jaime, Commented Aug 2, 2013 at 17:53

JAB · Accepted Answer · 2013-08-02 12:54:53Z

5

For Python 2.7+:

>>> from collections import Counter
>>> counts = Counter([-1,2,3,-1,3,4,4,4,4,4])
>>> counts.viewvalues() # counts.values() in Python 3+
dict_values([1, 2, 5, 2])

http://docs.python.org/library/collections.html#collections.Counter (There are implementations for 2.4 and 2.5 if you're stuck with older versions, though.)

And since Counter is subclassed from dict, you can get the values that are counted if you ever need them. counts.viewitems() (2.7) or counts.items() (3+) will give you an iterable mapping.

edited Aug 2, 2013 at 12:54

answered Aug 2, 2013 at 12:48

JAB

21.2k6 gold badges73 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Daniel · Accepted Answer · 2013-08-02 14:47:46Z

4

If you do want a numpy solution:

>>> a=np.array( [-1,2,3,-1,3,4,4,4,4,4])
>>> ua,uind=np.unique(a,return_inverse=True)

#This returns the unique values and indices of those values.
>>> ua
array([-1,  2,  3,  4])
>>> uind
array([0, 1, 2, 0, 2, 3, 3, 3, 3, 3])

>>> np.bincount(uind)
array([2, 1, 2, 5])

This has the additional benefit of showing what count goes with what number.

A bit over twice as fast for small arrays to boot:

import numpy as np
from collections import Counter

a=np.random.randint(0,100,(500))
alist=a.tolist()

In [27]: %timeit  Counter(alist).viewvalues()
1000 loops, best of 3: 209 us per loop

In [28]: %timeit ua,uind=np.unique(a,return_inverse=True);np.bincount(uind)
10000 loops, best of 3: 85.8 us per loop

edited Aug 2, 2013 at 14:47

answered Aug 2, 2013 at 14:32

Daniel

19.6k7 gold badges64 silver badges74 bronze badges

Comments

toro2k · Accepted Answer · 2013-08-02 14:01:25Z

0

Not sure whether this is the most elegan solution, but you could use this oneliner:

import numpy
aa = [-1,2,3,-1,3,4,4,4,4,4]
histogr = [aa.count(i) for i in numpy.unique(aa)]

edited Aug 2, 2013 at 14:01

toro2k

19.3k8 gold badges66 silver badges72 bronze badges

answered Aug 2, 2013 at 13:43

H van Buuren

758 bronze badges

1 Comment

komark Over a year ago

It is short indeed! However, this would produce quadratic runtime as I believe count is implemented by raw iteration over the array. Could be useful when number of unique elements is known to be small.

Collectives™ on Stack Overflow

exact histogram of an array

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related