Maximum values in y numpy array for corresponding unique values in x array

Question

I'm processing data where I have two sets of numbers, one for the x values and another for the corresponding y ordinates. I need to decimate the data to reduce the number of x values by eliminating duplicates while returning corresponding y values. I need to do this quickly on very large arrays so the code has to be efficient.

The numpy function 'unique' will eliminate duplicates in the x array. However, for each remaining x array ordinate I need to return the maximum y array ordinate for all those that corresponded to that x value. So if these are two example arrays like this:

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

What I need to end up with is:

x = [   0,  16,  24,  28,  30,  31,  32,   33]
y = [1050, 110, 104, 107, 820, 101, 649, 1020]

All help gratefully appreciated.

This can be done with x_unique = np.unique(x); y_max = np.zeros_like(x_unique); np.maximum.at(y_max, np.r_[False, (x[1:] != x[:-1]).cumsum()], y). Assuming x is sorted and x,y are np.array. pandas will be faster for large arrays. — Michael Szczesny
– Michael Szczesny, Commented Oct 8, 2022 at 4:17
gives me an: IndexError: index 8 is out of bounds for axis 0 with size 8 — user7711283
– user7711283, Commented Oct 8, 2022 at 5:34
To have a realistic benchmarking please tell us about the original format of the x, y values. Are x, y previously created lists or values got from a json file or numpy arrays (or one array with x, y) or columns of a pandas DataFrame. In addition to this it is necessary to know in which format the resulting x, y values are to be returned (as lists, numpy array(s), Pandas DataFrame, dictionary, ... So please provide this information along with your own coding attempt in your question raising this way the quality of the question making also the answers more valuable. — user7711283
– user7711283, Commented Oct 8, 2022 at 12:58
Hi Claudio. Thanks for the advice and help and points taken. x and y are numpy arrays of integers. The returned results need to be numpy arrays as well. The data is derived from realtime samples, typically somewhere around a million with 1k to 2k unique x values from 0 to tens of thousands while y is between 0 and hundreds of millions. The x values will be in order. Maybe in future they will not be, so a more efficient solution for ordered is great, an alternative for out of order is also good. By this stage values have been processed. The cycle repeats 2 to 20 times a second. — Richard
– Richard, Commented Oct 8, 2022 at 19:12

Mechanic Pig · Accepted Answer · 2022-10-08 05:46:51Z

3

After sorting, take out the first index of each unique value, which is used for argument of np.maximum.reduceat:

>>> x = np.asarray(x)
>>> y = np.asarray(y)
>>> perm = x.argsort()
>>> sort = x[perm]
>>> mask = np.concatenate([[True], sort[1:] != sort[:-1]])
>>> sort[mask]
array([ 0, 16, 24, 28, 30, 31, 32, 33])
>>> np.maximum.reduceat(y[perm], mask.nonzero()[0])
array([1050,  110,  104,  107,  820,  101,  649, 1020])

A simple benchmark for large array of 10 ** 6 size:

In [251]: def mechanic(x, y):
     ...:     x = np.asarray(x)
     ...:     y = np.asarray(y)
     ...:     perm = x.argsort()
     ...:     sort = x[perm]
     ...:     mask = np.concatenate([[True], sort[1:] != sort[:-1]])
     ...:     return sort[mask], np.maximum.reduceat(y[perm], mask.nonzero()[0])
     ...:

In [252]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     for g, v in groupby(sorted(zip(x, y)), lambda x: x[0]):
     ...:         xout += [g]
     ...:         yout += [max(v)[1]]
     ...:     return xout, yout
     ...:

In [253]: def joran_beasley(x, y):
     ...:     df = pd.DataFrame({'x': x, 'y': y})
     ...:     return (*df.groupby('x').agg({'x': 'first', 'y': 'max'}).values.T,)
     ...:

In [254]: import pandas as pd

In [255]: x, y = np.random.randint(0, 100, (2, 10 ** 6))

In [256]: %timeit mechanic(x, y)
65.6 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [257]: %timeit claudio(x, y)
2.6 s ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [258]: %timeit joran_beasley(x, y)
36.3 ms ± 755 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Start from list:

In [275]: x, y = np.random.randint(0, 100, (2, 10 ** 6)).tolist()

In [276]: %timeit joran_beasley(x, y)
404 ms ± 6.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [277]: %timeit mechanic(x, y)
193 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [278]: %timeit claudio(x, y)
1.02 s ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A little optimization of @Claudio 's solution:

In [283]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     firstgetter = itemgetter(0)
     ...:     secondgetter = itemgetter(1)
     ...:     for g, v in groupby(sorted(zip(x, y), key=firstgetter), firstgetter):
     ...:         xout.append(g)
     ...:         yout.append(max(map(secondgetter, v)))
     ...:     return xout, yout
     ...:

In [284]: %timeit claudio(x, y)
495 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The solution using defaultdict wins when using lists as input:

In [291]: def defdict_solution(x, y):
     ...:     defdict = defaultdict(list)
     ...:     for k, v in zip(x, y):
     ...:         defdict[k].append(v)
     ...:     lst = [(k, max(v)) for k, v in defdict.items()]
     ...:     lst.sort(key=itemgetter(0))
     ...:     return [k for k, v in lst], [v for k, v in lst]
     ...:

In [292]: %timeit defdict_solution(x, y)
73.9 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Oct 8, 2022 at 5:46

answered Oct 8, 2022 at 4:31

Mechanic Pig

7,7813 gold badges17 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user7711283 Over a year ago

This numpy solution is according to my timing measurements a bit slower for small arrays but twice as fast at large arrays compared to the itertools groupby approach.

Joran Beasley Over a year ago

im curious about timings with much larger arrays ... but nice work profiling it all :)

Michael Szczesny Over a year ago

@Richard - It's not a bad question, but could be much clearer (that's why different benchmarks were needed): tagging arrays/numpy but using lists as examples, x is sorted but nowhere mentioned whether this is guaranteed, not specifying what 'very large' means.

Mechanic Pig Over a year ago

@Richard I did not down vote it. Here are some guesses: your question does not reflect your own attempt. For programs that need to be optimized like this, a rough for loop program may better explain the purpose of the program clearly.

Mechanic Pig Over a year ago

@MichaelSzczesny Yes, if x is guaranteed to be sorted, my solution can save argsort and two indexes, which will greatly speed up the program.

|

Joran Beasley · Accepted Answer · 2022-10-08 03:56:31Z

1

turn it into a dataframe and groupby and aggregate :)

import pandas

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

df = pandas.DataFrame({'x':x,'y':y})
print(df.groupby('x').agg({'x':'first','y':'max'}))

answered Oct 8, 2022 at 3:56

Joran Beasley

114k13 gold badges168 silver badges187 bronze badges

5 Comments

user7711283 Over a year ago

This approach is according to my timing measurements at least 150x times slower than using itertools groupby working directly on the list. Hmmm ... And if I take the time for the imports of pandas and itertools in consideration the pandas solution will be 15000x times slower.

Michael Szczesny Over a year ago

@Claudio - Only for small arrays. With 10^6 elements it is ~30x faster.

user7711283 Over a year ago

@MichaelSzczesny: had just tested it with 10**6 elements. Pandas is still 1.14 times slower, so I am wondering how do you get the 30x faster timing?

Michael Szczesny Over a year ago

@Claudio - colab notebook with the code for the benchmark

user7711283 Over a year ago

@MichaelSzczesny you are using different parameter setting for the timing measurements, so the results can differ because of that. In addition you don't start with a pair of lists feeding itertools with numpy arrays.

user7711283 · Accepted Answer · 2022-10-08 04:05:43Z

0

An approach working on lists as lists are given as data for input and output in the question:

from itertools import groupby
xout = []
yout = []
for g, v in groupby(sorted(zip(xin, yin)), lambda x: x[0]):
    xout += [g]
    yout += [max(v)[1]]

answered Oct 8, 2022 at 4:05

user7711283

Collectives™ on Stack Overflow

Maximum values in y numpy array for corresponding unique values in x array

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related