0

Sorry for the cryptic description.....

I'm workng in Python and need a fast solution for the below problem

I have an array of float values in one array (this array length can include be millions of values

values = [0.1, 0.2, 5.7, 12.9, 3.5, 100.6]

Each value represents an estimate of a quantity at a particular location where the location is identified by an ID. Multiple estimates per location are possible/common

locations = [1, 5, 3, 1, 1, 3]

I need to average all of the values that that share the same location id.

I can use numpy.where to do this for one location value

average_value_at_location = np.average(values[np.where(locations == 1)])

And of course I could loop over all of the unique values in locations..... But I'm looking for a fast (vectorized) way of doing this and can't figure out how to compose the numpy functions to do this without looping in Python.....

I'm not tied to numpy for this solution.

Any help will be gratefully received.

Thanks,

Doug

3
  • 1
    it sounds like doing average per group, here is a link that gives a couple of solutions to do so in numpy Commented Aug 25, 2021 at 20:33
  • 1
    I think looping over locations in np.unique isn't going to be that much slower, especially if you have only a few locations. Most of your work is doing np.mean over millions of values, after all. If you really want to squeeze out that little bit of performance out of your loop, you could try using Numba to compile the loop. You could also use Numba to avoid making copies of arrays, but that would require you implement your own mean algorithm, which is actually kinda complicated (there are many algorithms with different tradeoffs, and you'd probably want to match np.mean). Commented Aug 25, 2021 at 20:44
  • 1
    @Ben.T - Thanks for the links. One of them points to a pandas groupby solution which I had also comes across sonce posting. It looks pretty good - averaged 100,000 values across 25,000 location in less than 0.01 seconds on my laptop. I'm currently looking at the numpy_indexed package which one of the links recommends - for some reason I'm motivated to find a numpy solution....... Commented Aug 25, 2021 at 23:37

3 Answers 3

1

Assuming locations go from 0 to a maximum value of locmax (e.g. locmax=5), you can create a 2d array of nans to store the values at the corresponding location:

placements = np.zeros((values.size, locmax+1)) * np.nan

Then assign all the values using indexing:

placements[np.arange(values.size), locations] = values

Finally, calculate the np.nanmean along axis 0:

means = np.nanmean(placements, axis=0)

For your example this results in:

array([  nan,  5.5 ,   nan, 53.15,   nan,  0.2 ])
Sign up to request clarification or add additional context in comments.

Comments

1

Using add.reduceat for every group.

Preparing the arrays

import numpy as np

values = np.array([0.1, 0.2, 5.7, 12.9, 3.5, 100.6])
locations = np.array([1, 5, 3, 1, 1, 3])

Getting the indices to sort the arrays in groups

locsort = np.argsort(locations)
# locations[locsort] -> [  1,    1,   1,   3,     3,   5]
# values[locsort]    -> [0.1, 12.9, 3.5, 5.7, 100.6, 0.2]

Computing the start index for each group

i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
# [0, 3, 5]

Adding values per group and dividing by the group size

np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))
# [ 16.5, 106.3,   0.2] / [3, 2, 1]            

Output

array([ 5.5 , 53.15,  0.2 ])

Comments

1

OK - I've tried four solutions based on the replies here. So far, the pandas groupby approach is the winner, but the numpy add.reduceat solution proposed by Michael S is a close second......

Using pandas (from the link provided by Ben T)

# Set up the data arrays
rng = np.random.default_rng(12345)
values = rng.random(size = 100000)
locations = rng.integers(low = 1, high = 25000, size = 100000)

#Create the pandas dataframe
df = pd.DataFrame({"locations":locations, "values": values})

# groupby and mean
start=timer()
average_by_location_pandas = df.groupby(["locations"]).mean()
end=timer()

print("Pandas time :", end-start)

Pandas time : 0.009602722000000008

Using numpy np.where and list comprehension to lop over unique locations

unique_locations = np.unique(locations)
average_by_location_numpy = [(i, values[locations==i].mean()) for i in unique_locations]

Numpy time : 2.644003632

Using numpy_indexed (link provide by Ben T)

average_by_location_numpy_indexed = npi.group_by(locations).mean(values)

Numpy_indexed time : 0.03701074199999965

Using numpy add.reduceat (solution proposed by Michael S)

locsort = np.argsort(locations)
i = np.flatnonzero(np.diff(locations[locsort], prepend=0))
out = np.add.reduceat(values[locsort], i) / np.diff(i, append=len(locsort))

Numpy add_reduceat time : 0.01057279099999997

1 Comment

I`m surprised that the pandas solution win against the different numpy solutions. I guess the number for different locations is the reason.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.