Fast, efficient approach to average values in one array selected using a key in another array

Question

Sorry for the cryptic description.....

I'm workng in Python and need a fast solution for the below problem

I have an array of float values in one array (this array length can include be millions of values

values = [0.1, 0.2, 5.7, 12.9, 3.5, 100.6]

Each value represents an estimate of a quantity at a particular location where the location is identified by an ID. Multiple estimates per location are possible/common

locations = [1, 5, 3, 1, 1, 3]

I need to average all of the values that that share the same location id.

I can use numpy.where to do this for one location value

average_value_at_location = np.average(values[np.where(locations == 1)])

And of course I could loop over all of the unique values in locations..... But I'm looking for a fast (vectorized) way of doing this and can't figure out how to compose the numpy functions to do this without looping in Python.....

I'm not tied to numpy for this solution.

Any help will be gratefully received.

Thanks,

Doug

it sounds like doing average per group, here is a link that gives a couple of solutions to do so in numpy — Ben.T
– Ben.T, Commented Aug 25, 2021 at 20:33
I think looping over locations in np.unique isn't going to be that much slower, especially if you have only a few locations. Most of your work is doing np.mean over millions of values, after all. If you really want to squeeze out that little bit of performance out of your loop, you could try using Numba to compile the loop. You could also use Numba to avoid making copies of arrays, but that would require you implement your own mean algorithm, which is actually kinda complicated (there are many algorithms with different tradeoffs, and you'd probably want to match np.mean). — BatWannaBe
– BatWannaBe, Commented Aug 25, 2021 at 20:44
@Ben.T - Thanks for the links. One of them points to a pandas groupby solution which I had also comes across sonce posting. It looks pretty good - averaged 100,000 values across 25,000 location in less than 0.01 seconds on my laptop. I'm currently looking at the numpy_indexed package which one of the links recommends - for some reason I'm motivated to find a numpy solution....... — scotsman60
– scotsman60, Commented Aug 25, 2021 at 23:37

Jan Christoph Terasa · Accepted Answer · 2021-08-25 20:37:48Z

1

Assuming locations go from 0 to a maximum value of locmax (e.g. locmax=5), you can create a 2d array of nans to store the values at the corresponding location:

placements = np.zeros((values.size, locmax+1)) * np.nan

Then assign all the values using indexing:

placements[np.arange(values.size), locations] = values

Finally, calculate the np.nanmean along axis 0:

means = np.nanmean(placements, axis=0)

For your example this results in:

array([  nan,  5.5 ,   nan, 53.15,   nan,  0.2 ])

answered Aug 25, 2021 at 20:37

Jan Christoph Terasa

6,00327 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Michael Szczesny · Accepted Answer · 2021-08-25 21:32:21Z

1

Using add.reduceat for every group.

Preparing the arrays

import numpy as np

values = np.array([0.1, 0.2, 5.7, 12.9, 3.5, 100.6])
locations = np.array([1, 5, 3, 1, 1, 3])

Getting the indices to sort the arrays in groups

locsort = np.argsort(locations)
# locations[locsort] -> [  1,    1,   1,   3,     3,   5]
# values[locsort]    -> [0.1, 12.9, 3.5, 5.7, 100.6, 0.2]

Computing the start index for each group

i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
# [0, 3, 5]

Adding values per group and dividing by the group size

np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))
# [ 16.5, 106.3,   0.2] / [3, 2, 1]

Output

array([ 5.5 , 53.15,  0.2 ])

edited Aug 25, 2021 at 21:32

answered Aug 25, 2021 at 21:24

Michael Szczesny

5,0465 gold badges20 silver badges36 bronze badges

Comments

scotsman60 · Accepted Answer · 2021-08-26 02:20:08Z

1

OK - I've tried four solutions based on the replies here. So far, the pandas groupby approach is the winner, but the numpy add.reduceat solution proposed by Michael S is a close second......

Using pandas (from the link provided by Ben T)

# Set up the data arrays
rng = np.random.default_rng(12345)
values = rng.random(size = 100000)
locations = rng.integers(low = 1, high = 25000, size = 100000)

#Create the pandas dataframe
df = pd.DataFrame({"locations":locations, "values": values})

# groupby and mean
start=timer()
average_by_location_pandas = df.groupby(["locations"]).mean()
end=timer()

print("Pandas time :", end-start)

Pandas time : 0.009602722000000008

Using numpy np.where and list comprehension to lop over unique locations

unique_locations = np.unique(locations)
average_by_location_numpy = [(i, values[locations==i].mean()) for i in unique_locations]

Numpy time : 2.644003632

Using numpy_indexed (link provide by Ben T)

average_by_location_numpy_indexed = npi.group_by(locations).mean(values)

Numpy_indexed time : 0.03701074199999965

Using numpy add.reduceat (solution proposed by Michael S)

locsort = np.argsort(locations)
i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
out = np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))

Numpy add_reduceat time : 0.01057279099999997

edited Aug 26, 2021 at 2:20

answered Aug 25, 2021 at 23:52

scotsman60

3911 silver badge13 bronze badges

1 Comment

Ben.T Over a year ago

I`m surprised that the pandas solution win against the different numpy solutions. I guess the number for different locations is the reason.

Collectives™ on Stack Overflow

Fast, efficient approach to average values in one array selected using a key in another array

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related