0

I'm processing data where I have two sets of numbers, one for the x values and another for the corresponding y ordinates. I need to decimate the data to reduce the number of x values by eliminating duplicates while returning corresponding y values. I need to do this quickly on very large arrays so the code has to be efficient.

The numpy function 'unique' will eliminate duplicates in the x array. However, for each remaining x array ordinate I need to return the maximum y array ordinate for all those that corresponded to that x value. So if these are two example arrays like this:

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

What I need to end up with is:

x = [   0,  16,  24,  28,  30,  31,  32,   33]
y = [1050, 110, 104, 107, 820, 101, 649, 1020]

All help gratefully appreciated.

4
  • This can be done with x_unique = np.unique(x); y_max = np.zeros_like(x_unique); np.maximum.at(y_max, np.r_[False, (x[1:] != x[:-1]).cumsum()], y). Assuming x is sorted and x,y are np.array. pandas will be faster for large arrays. Commented Oct 8, 2022 at 4:17
  • gives me an: IndexError: index 8 is out of bounds for axis 0 with size 8 Commented Oct 8, 2022 at 5:34
  • To have a realistic benchmarking please tell us about the original format of the x, y values. Are x, y previously created lists or values got from a json file or numpy arrays (or one array with x, y) or columns of a pandas DataFrame. In addition to this it is necessary to know in which format the resulting x, y values are to be returned (as lists, numpy array(s), Pandas DataFrame, dictionary, ... So please provide this information along with your own coding attempt in your question raising this way the quality of the question making also the answers more valuable. Commented Oct 8, 2022 at 12:58
  • Hi Claudio. Thanks for the advice and help and points taken. x and y are numpy arrays of integers. The returned results need to be numpy arrays as well. The data is derived from realtime samples, typically somewhere around a million with 1k to 2k unique x values from 0 to tens of thousands while y is between 0 and hundreds of millions. The x values will be in order. Maybe in future they will not be, so a more efficient solution for ordered is great, an alternative for out of order is also good. By this stage values have been processed. The cycle repeats 2 to 20 times a second. Commented Oct 8, 2022 at 19:12

3 Answers 3

3

After sorting, take out the first index of each unique value, which is used for argument of np.maximum.reduceat:

>>> x = np.asarray(x)
>>> y = np.asarray(y)
>>> perm = x.argsort()
>>> sort = x[perm]
>>> mask = np.concatenate([[True], sort[1:] != sort[:-1]])
>>> sort[mask]
array([ 0, 16, 24, 28, 30, 31, 32, 33])
>>> np.maximum.reduceat(y[perm], mask.nonzero()[0])
array([1050,  110,  104,  107,  820,  101,  649, 1020])

A simple benchmark for large array of 10 ** 6 size:

In [251]: def mechanic(x, y):
     ...:     x = np.asarray(x)
     ...:     y = np.asarray(y)
     ...:     perm = x.argsort()
     ...:     sort = x[perm]
     ...:     mask = np.concatenate([[True], sort[1:] != sort[:-1]])
     ...:     return sort[mask], np.maximum.reduceat(y[perm], mask.nonzero()[0])
     ...:

In [252]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     for g, v in groupby(sorted(zip(x, y)), lambda x: x[0]):
     ...:         xout += [g]
     ...:         yout += [max(v)[1]]
     ...:     return xout, yout
     ...:

In [253]: def joran_beasley(x, y):
     ...:     df = pd.DataFrame({'x': x, 'y': y})
     ...:     return (*df.groupby('x').agg({'x': 'first', 'y': 'max'}).values.T,)
     ...:

In [254]: import pandas as pd

In [255]: x, y = np.random.randint(0, 100, (2, 10 ** 6))

In [256]: %timeit mechanic(x, y)
65.6 ms ± 2.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [257]: %timeit claudio(x, y)
2.6 s ± 56.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [258]: %timeit joran_beasley(x, y)
36.3 ms ± 755 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Start from list:

In [275]: x, y = np.random.randint(0, 100, (2, 10 ** 6)).tolist()

In [276]: %timeit joran_beasley(x, y)
404 ms ± 6.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [277]: %timeit mechanic(x, y)
193 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [278]: %timeit claudio(x, y)
1.02 s ± 20.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

A little optimization of @Claudio 's solution:

In [283]: def claudio(x, y):
     ...:     xout = []
     ...:     yout = []
     ...:     firstgetter = itemgetter(0)
     ...:     secondgetter = itemgetter(1)
     ...:     for g, v in groupby(sorted(zip(x, y), key=firstgetter), firstgetter):
     ...:         xout.append(g)
     ...:         yout.append(max(map(secondgetter, v)))
     ...:     return xout, yout
     ...:

In [284]: %timeit claudio(x, y)
495 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The solution using defaultdict wins when using lists as input:

In [291]: def defdict_solution(x, y):
     ...:     defdict = defaultdict(list)
     ...:     for k, v in zip(x, y):
     ...:         defdict[k].append(v)
     ...:     lst = [(k, max(v)) for k, v in defdict.items()]
     ...:     lst.sort(key=itemgetter(0))
     ...:     return [k for k, v in lst], [v for k, v in lst]
     ...:

In [292]: %timeit defdict_solution(x, y)
73.9 ms ± 723 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Sign up to request clarification or add additional context in comments.

6 Comments

This numpy solution is according to my timing measurements a bit slower for small arrays but twice as fast at large arrays compared to the itertools groupby approach.
im curious about timings with much larger arrays ... but nice work profiling it all :)
@Richard - It's not a bad question, but could be much clearer (that's why different benchmarks were needed): tagging arrays/numpy but using lists as examples, x is sorted but nowhere mentioned whether this is guaranteed, not specifying what 'very large' means.
@Richard I did not down vote it. Here are some guesses: your question does not reflect your own attempt. For programs that need to be optimized like this, a rough for loop program may better explain the purpose of the program clearly.
@MichaelSzczesny Yes, if x is guaranteed to be sorted, my solution can save argsort and two indexes, which will greatly speed up the program.
|
1

turn it into a dataframe and groupby and aggregate :)

import pandas

x = [   0,  16,  24,  28,  30,  31,  32,  32,  33,  33,   33,  33]
y = [1050, 110, 104, 107, 820, 101, 102, 649, 103, 101, 1020, 100]

df = pandas.DataFrame({'x':x,'y':y})
print(df.groupby('x').agg({'x':'first','y':'max'}))

5 Comments

This approach is according to my timing measurements at least 150x times slower than using itertools groupby working directly on the list. Hmmm ... And if I take the time for the imports of pandas and itertools in consideration the pandas solution will be 15000x times slower.
@Claudio - Only for small arrays. With 10^6 elements it is ~30x faster.
@MichaelSzczesny: had just tested it with 10**6 elements. Pandas is still 1.14 times slower, so I am wondering how do you get the 30x faster timing?
@Claudio - colab notebook with the code for the benchmark
@MichaelSzczesny you are using different parameter setting for the timing measurements, so the results can differ because of that. In addition you don't start with a pair of lists feeding itertools with numpy arrays.
0

An approach working on lists as lists are given as data for input and output in the question:

from itertools import groupby
xout = []
yout = []
for g, v in groupby(sorted(zip(xin, yin)), lambda x: x[0]):
    xout += [g]
    yout += [max(v)[1]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.