Fastest way to compute entropy of each numpy array row?

Question

I have a array in size MxN and I like to compute the entropy value of each row. What would be the fastest way to do so ?

By fastest, do you mean that you want an optimized version of it, or do you mean that you want something that fits in one line and is easy to read? — Emilien
– Emilien, Commented Nov 9, 2015 at 10:36
not easiest, fastest in terms of computation since I have a pretty large matrix, row iteration takes too much time. — erogol
– erogol, Commented Nov 9, 2015 at 10:38
Your first comment should be part of the question. I interpret the question as "I have an array of probabilities probs, and I want the entropy of the rows." If you don't already have probabilities, please clarify the question. — Warren Weckesser
– Warren Weckesser, Commented Nov 9, 2015 at 12:38

Warren Weckesser · Accepted Answer · 2015-11-09 12:56:50Z

14

scipy.special.entr computes -x*log(x) for each element in an array. After calling that, you can sum the rows.

Here's an example. First, create an array p of positive values whose rows sum to 1:

In [23]: np.random.seed(123)

In [24]: x = np.random.rand(3, 10)

In [25]: p = x/x.sum(axis=1, keepdims=True)

In [26]: p
Out[26]: 
array([[ 0.12798052,  0.05257987,  0.04168536,  0.1013075 ,  0.13220688,
         0.07774843,  0.18022149,  0.1258417 ,  0.08837421,  0.07205402],
       [ 0.08313743,  0.17661773,  0.1062474 ,  0.01445742,  0.09642919,
         0.17878489,  0.04420998,  0.0425045 ,  0.12877228,  0.1288392 ],
       [ 0.11793032,  0.15790292,  0.13467074,  0.11358463,  0.13429674,
         0.06003561,  0.06725376,  0.0424324 ,  0.05459921,  0.11729367]])

In [27]: p.shape
Out[27]: (3, 10)

In [28]: p.sum(axis=1)
Out[28]: array([ 1.,  1.,  1.])

Now compute the entropy of each row. entr uses the natural logarithm, so to get the base-2 log, divide the result by log(2).

In [29]: from scipy.special import entr

In [30]: entr(p).sum(axis=1)
Out[30]: array([ 2.22208731,  2.14586635,  2.22486581])

In [31]: entr(p).sum(axis=1)/np.log(2)
Out[31]: array([ 3.20579434,  3.09583074,  3.20980287])

If you don't want the dependency on scipy, you can use the explicit formula:

In [32]: (-p*np.log2(p)).sum(axis=1)
Out[32]: array([ 3.20579434,  3.09583074,  3.20980287])

edited Nov 9, 2015 at 12:56

answered Nov 9, 2015 at 12:36

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JStrahl Over a year ago

My probabilities were all 0. To solve this I had to cast the denominator sum to float, e.g., p = x/float(x.sum(axis=1, keepdims=True)). In case someone has the same problem.

SpinUp __ A Davis Over a year ago

scipy.stats.entropy also calculates the same value as entr(p).sum(axis=1)

ali_m · Accepted Answer · 2015-11-09 15:41:04Z

3

As @Warren pointed out, it's unclear from your question whether you are starting out from an array of probabilities, or from the raw samples themselves. In my answer I've assumed the latter, in which case the main bottleneck will be computing the bin counts over each row.

Assuming that each vector of samples is relatively long, the fastest way to do this will probably be to use np.bincount:

import numpy as np

def entropy(x):
    """
    x is assumed to be an (nsignals, nsamples) array containing integers between
    0 and n_unique_vals
    """
    x = np.atleast_2d(x)
    nrows, ncols = x.shape
    nbins = x.max() + 1

    # count the number of occurrences for each unique integer between 0 and x.max()
    # in each row of x
    counts = np.vstack((np.bincount(row, minlength=nbins) for row in x))

    # divide by number of columns to get the probability of each unique value
    p = counts / float(ncols)

    # compute Shannon entropy in bits
    return -np.sum(p * np.log2(p), axis=1)

Although Warren's method of computing the entropies from the probability values using entr is slightly faster than using the explicit formula, in practice this is likely to represent a tiny fraction of the total runtime compared to the time taken to compute the bin counts.

Test correctness for a single row:

vals = np.arange(3)
prob = np.array([0.1, 0.7, 0.2])
row = np.random.choice(vals, p=prob, size=1000000)

print("theoretical H(x): %.6f, empirical H(x): %.6f" %
      (-np.sum(prob * np.log2(prob)), entropy(row)[0]))
# theoretical H(x): 1.156780, empirical H(x): 1.157532

Test speed:

In [1]: %%timeit x = np.random.choice(vals, p=prob, size=(1000, 10000))
   ....: entropy(x)
   ....: 
10 loops, best of 3: 34.6 ms per loop

If your data don't consist of integer indices between 0 and the number of unique values, you can convert them into this format using np.unique:

y = np.random.choice([2.5, 3.14, 42], p=prob, size=(1000, 10000))
unq, x = np.unique(y, return_inverse=True)
x.shape = y.shape

edited Nov 9, 2015 at 15:41

answered Nov 9, 2015 at 11:59

ali_m

74.6k28 gold badges230 silver badges315 bronze badges

2 Comments

atomsmasher Over a year ago

you might be able to save time by using -np.dot(a, b) in place of -np.sum(a * b)

ali_m Over a year ago

@atomsmasher With np.dot I can't easily vectorize the entropy calculation over multiple rows. One way would be something like -np.einsum('ij,ij->i', p, np.log2(p)), although really you might as well just use entr for this part since it has an axis argument. Either way, the costly part is usually computing the bin counts.

Collectives™ on Stack Overflow

Fastest way to compute entropy of each numpy array row?

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related