Filter a numpy array based on largest value

Question

I have a numpy array which holds 4-dimensional vectors which have the following format (x, y, z, w)

The size of the array is 4 x N. Now, the data I have is where I have (x, y, z) spatial locations and w holds some particular measurement at this location. Now, there could be multiple measurements associated with an (x, y, z) position (measured as floats).

What I would like to do is filter the array, so that I get a new array where I get the maximum measurement corresponding with each (x, y, z) position.

So if my data is like:

x, y, z, w1
x, y, z, w2
x, y, z, w3

where w1 is greater than w2 and w3, the filtered data would be:

x, y, z, w1

So more concretely, say I have data like:

[[ 0.7732126   0.48649481  0.29771819  0.91622924]
 [ 0.7732126   0.48649481  0.29771819  1.91622924]
 [ 0.58294263  0.32025559  0.6925856   0.0524125 ]
 [ 0.58294263  0.32025559  0.6925856   0.05 ]
 [ 0.58294263  0.32025559  0.6925856   1.7 ]
 [ 0.3239913   0.7786444   0.41692853  0.10467392]
 [ 0.12080023  0.74853649  0.15356663  0.4505753 ]
 [ 0.13536096  0.60319054  0.82018125  0.10445047]
 [ 0.1877724   0.96060999  0.39697999  0.59078612]]

This should return

[[ 0.7732126   0.48649481  0.29771819  1.91622924]
 [ 0.58294263  0.32025559  0.6925856   1.7 ]
 [ 0.3239913   0.7786444   0.41692853  0.10467392]
 [ 0.12080023  0.74853649  0.15356663  0.4505753 ]
 [ 0.13536096  0.60319054  0.82018125  0.10445047]
 [ 0.1877724   0.96060999  0.39697999  0.59078612]]

Will the entries for the same (x,y,z) position always be consecutive, as in your sample data, or will they be scattered? About how many entries will you have in practice? — jme
– jme, Commented Aug 17, 2015 at 14:39
They could be scattered unfortunately. They will never be more than 4. Performance is not critical for this fortunately. — Luca
– Luca, Commented Aug 17, 2015 at 14:41
FYI: This is a known as a "group-by" operation (cf. pandas.pydata.org/pandas-docs/stable/groupby.html). You are grouping by the first three columns, and then applying the maximum function to the groups. This is pretty easy to do with a library such as pandas (pandas.pydata.org). — Warren Weckesser
– Warren Weckesser, Commented Aug 17, 2015 at 14:43
Ahhhhh... I was not aware of pandas. I will see if I am able to use it successfully. Thanks for the tip! — Luca
– Luca, Commented Aug 17, 2015 at 14:44

Jaime · Accepted Answer · 2015-08-17 17:18:46Z

This is convoluted, but it is probably as good as you are going to get using numpy only...

First, we use lexsort to put all entries with the same coordinates together. With a being your sample array:

>>> perm = np.lexsort(a[:, 3::-1].T)
>>> a[perm]
array([[ 0.12080023,  0.74853649,  0.15356663,  0.4505753 ],
       [ 0.7732126 ,  0.48649481,  0.29771819,  0.91622924],
       [ 0.7732126 ,  0.48649481,  0.29771819,  1.91622924],
       [ 0.1877724 ,  0.96060999,  0.39697999,  0.59078612],
       [ 0.3239913 ,  0.7786444 ,  0.41692853,  0.10467392],
       [ 0.58294263,  0.32025559,  0.6925856 ,  0.0524125 ],
       [ 0.58294263,  0.32025559,  0.6925856 ,  0.05      ],
       [ 0.58294263,  0.32025559,  0.6925856 ,  1.7       ],
       [ 0.13536096,  0.60319054,  0.82018125,  0.10445047]])

Note that by reversing the axis, we are sorting by x, breaking ties with y, then z, then w.

Because it is the maximum we are looking for, we just need to take the last entry in every group, which is a pretty straightforward thing to do:

>>> a_sorted = a[perm]
>>> last = np.concatenate((np.all(a_sorted[:-1, :3] != a_sorted[1:, :3], axis=1),
                           [True]))
>>> a_unique_max = a_sorted[last]
>>> a_unique_max
array([[ 0.12080023,  0.74853649,  0.15356663,  0.4505753 ],
       [ 0.13536096,  0.60319054,  0.82018125,  0.10445047],
       [ 0.1877724 ,  0.96060999,  0.39697999,  0.59078612],
       [ 0.3239913 ,  0.7786444 ,  0.41692853,  0.10467392],
       [ 0.58294263,  0.32025559,  0.6925856 ,  1.7       ],
       [ 0.7732126 ,  0.48649481,  0.29771819,  1.91622924]])

If you would rather not have the output sorted, but keep them in the original order they came up in the original array, you can also get that with the aid of perm:

>>> a_unique_max[np.argsort(perm[last])]
array([[ 0.7732126 ,  0.48649481,  0.29771819,  1.91622924],
       [ 0.58294263,  0.32025559,  0.6925856 ,  1.7       ],
       [ 0.3239913 ,  0.7786444 ,  0.41692853,  0.10467392],
       [ 0.12080023,  0.74853649,  0.15356663,  0.4505753 ],
       [ 0.13536096,  0.60319054,  0.82018125,  0.10445047],
       [ 0.1877724 ,  0.96060999,  0.39697999,  0.59078612]])

This will only work for the maximum, and it comes as a by-product of the sorting. If you are after a different function, say the product of all same-coordinates entries, you could do something like:

>>> first = np.concatenate(([True],
                            np.all(a_sorted[:-1, :3] != a_sorted[1:, :3], axis=1)))
>>> a_unique_prods = np.multiply.reduceat(a_sorted, np.nonzero(first)[0])

And you will have to play a little around with these results to assemble your return array.

Randy · Accepted Answer · 2015-08-18 03:30:39Z

I see that you already got the pointer towards pandas in the comments. FWIW, here's how you can get the desired behavior, assuming you don't care about the final sort order since groupby changes it up.

In [14]: arr
Out[14]:
array([[ 0.7732126 ,  0.48649481,  0.29771819,  0.91622924],
       [ 0.7732126 ,  0.48649481,  0.29771819,  1.91622924],
       [ 0.58294263,  0.32025559,  0.6925856 ,  0.0524125 ],
       [ 0.58294263,  0.32025559,  0.6925856 ,  0.05      ],
       [ 0.58294263,  0.32025559,  0.6925856 ,  1.7       ],
       [ 0.3239913 ,  0.7786444 ,  0.41692853,  0.10467392],
       [ 0.12080023,  0.74853649,  0.15356663,  0.4505753 ],
       [ 0.13536096,  0.60319054,  0.82018125,  0.10445047],
       [ 0.1877724 ,  0.96060999,  0.39697999,  0.59078612]])

In [15]: import pandas as pd

In [16]: pd.DataFrame(arr)
Out[16]:
          0         1         2         3
0  0.773213  0.486495  0.297718  0.916229
1  0.773213  0.486495  0.297718  1.916229
2  0.582943  0.320256  0.692586  0.052413
3  0.582943  0.320256  0.692586  0.050000
4  0.582943  0.320256  0.692586  1.700000
5  0.323991  0.778644  0.416929  0.104674
6  0.120800  0.748536  0.153567  0.450575
7  0.135361  0.603191  0.820181  0.104450
8  0.187772  0.960610  0.396980  0.590786

In [17]: pd.DataFrame(arr).groupby([0,1,2]).max().reset_index()
Out[17]:
          0         1         2         3
0  0.120800  0.748536  0.153567  0.450575
1  0.135361  0.603191  0.820181  0.104450
2  0.187772  0.960610  0.396980  0.590786
3  0.323991  0.778644  0.416929  0.104674
4  0.582943  0.320256  0.692586  1.700000
5  0.773213  0.486495  0.297718  1.916229

Thanks. Very good solution as well. I am going to explore this in detail as well.

Divakar · Accepted Answer · 2015-08-18 06:44:25Z

You can start off with lex-sorting input array to bring entries with identical first three elements in succession. Then, create another 2D array to store the last column entries, such that elements corresponding to each duplicate triplet goes into the same rows. Next, find the max along axis=1 for this 2D array and thus have the final max output for each such unique triplet. Here's the implementation, assuming A as the input array -

# Lex sort A
sortedA = A[np.lexsort(A[:,:-1].T)]

# Mask of start of unique first three columns from A
start_unqA = np.append(True,~np.all(np.diff(sortedA[:,:-1],axis=0)==0,axis=1))

# Counts of unique first three columns from A
counts = np.bincount(start_unqA.cumsum()-1)
mask = np.arange(counts.max()) < counts[:,None]

# Group A's last column into rows based on uniqueness from first three columns
grpA = np.empty(mask.shape)
grpA.fill(np.nan)
grpA[mask] = sortedA[:,-1]

# Concatenate unique first three columns from A and 
# corresponding max values for each such unique triplet
out = np.column_stack((sortedA[start_unqA,:-1],np.nanmax(grpA,axis=1)))

Sample run -

In [75]: A
Out[75]: 
array([[ 1,  1,  1, 96],
       [ 1,  2,  2, 48],
       [ 2,  1,  2, 33],
       [ 1,  1,  1, 24],
       [ 1,  1,  1, 94],
       [ 2,  2,  2,  5],
       [ 2,  1,  1, 17],
       [ 2,  2,  2, 62]])

In [76]: sortedA
Out[76]: 
array([[ 1,  1,  1, 96],
       [ 1,  1,  1, 24],
       [ 1,  1,  1, 94],
       [ 2,  1,  1, 17],
       [ 2,  1,  2, 33],
       [ 1,  2,  2, 48],
       [ 2,  2,  2,  5],
       [ 2,  2,  2, 62]])

In [77]: out
Out[77]: 
array([[  1.,   1.,   1.,  96.],
       [  2.,   1.,   1.,  17.],
       [  2.,   1.,   2.,  33.],
       [  1.,   2.,   2.,  48.],
       [  2.,   2.,   2.,  62.]])

TheBlackCat · Accepted Answer · 2015-08-17 14:30:49Z

-1

You can use logical indexing.

I will use random data for an example:

>>> myarr = np.random.random((6, 4))
>>> print(myarr)
[[ 0.7732126   0.48649481  0.29771819  0.91622924]
 [ 0.58294263  0.32025559  0.6925856   0.0524125 ]
 [ 0.3239913   0.7786444   0.41692853  0.10467392]
 [ 0.12080023  0.74853649  0.15356663  0.4505753 ]
 [ 0.13536096  0.60319054  0.82018125  0.10445047]
 [ 0.1877724   0.96060999  0.39697999  0.59078612]]

To get the row or rows where the last column is the greatest, do this:

>>> greatest = myarr[myarr[:, 3]==myarr[:, 3].max()]
>>> print(greatest)
[[ 0.7732126   0.48649481  0.29771819  0.91622924]]

What this does is it gets the last column of myarr, and finds the maximum of that column, finds all the elements of that column equal to the maximum, and then gets the corresponding rows.

answered Aug 17, 2015 at 14:30

TheBlackCat

10.4k3 gold badges26 silver badges32 bronze badges

1 Comment

Luca Over a year ago

This is not the behaviour I seek. I have made an edit to the question to hopefully make it more clear.

asiviero · Accepted Answer · 2015-08-17 14:31:25Z

-1

You can use np.argmax

x[np.argmax(x[:,3]),:]

>>> x = np.random.random((5,4))
>>> x
array([[ 0.25461146,  0.35671081,  0.54856798,  0.2027313 ],
       [ 0.17079029,  0.66970362,  0.06533572,  0.31704254],
       [ 0.4577928 ,  0.69022073,  0.57128696,  0.93995176],
       [ 0.29708841,  0.96324181,  0.78859008,  0.25433235],
       [ 0.58739451,  0.17961551,  0.67993786,  0.73725493]])
>>> x[np.argmax(x[:,3]),:]
array([ 0.4577928 ,  0.69022073,  0.57128696,  0.93995176])

answered Aug 17, 2015 at 14:31

asiviero

1,24510 silver badges16 bronze badges

1 Comment

Luca Over a year ago

This is not the behaviour I seek. I have made an edit to the question to hopefully make it more clear.

Collectives™ on Stack Overflow

Filter a numpy array based on largest value

5 Answers 5

Comments

1 Comment

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

1 Comment

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related