How to sort numpy array by row sum and extract top N rows

Question

For example, given matrix

array([[ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [ 0,  1,  2,  3,  4,  5],
       [24, 25, 26, 27, 28, 29]])

and top_n=3, it should return

array([[24, 25, 26, 27, 28, 29],
       [18, 19, 20, 21, 22, 23],
       [12, 13, 14, 15, 16, 17]])

This function should return a np.ndarray of shape (top_n, arr.shape[-1]), given the input 2D matrix arr.

Here's what I tried:

def select_rows(arr, top_n):
    """
    This function selects the top_n rows that have the largest sum of entries
    """
    sel_rows = np.argsort(-arr,axis=1)[:top_n]
    
    return sel_rows

I also tried:

sel_rows = (-arr).argsort(axis=-1)[:, :top_n]

to no avail.

Casting the array to negative with - is less efficient that slicing the data at the end. For the small sample this isn't an issue, but casting all the values to negative in a large array will be somewhat slower, which is verified with a %%timeit test. — Trenton McKinney
– Trenton McKinney, Commented Sep 1, 2021 at 5:04

Trenton McKinney · Accepted Answer · 2021-09-01 05:08:05Z

5

You can use this simple 1-liner a[np.argsort(a.sum(axis=1))[:-top_n-1:-1]]

a.sum(axis=1) sums along axis 1

np.argsort(..., axis=0) argsorts along axis 0 (axis=0 is default option anyway so could be omitted)

...[:-top_n-1:-1] picks the last top_n indices in reverse order

a[...] then grabs the rows

`%%timeit` comparison

# data sample
a = np.random.randint(0, 101, (100000, 1000))

%%timeit
a[np.argsort(a.sum(axis=1))[:-3-1:-1]]
[out]:
9.73 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
a[np.argsort(-a.sum(axis=1))[:3]]
[out]:
9.9 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
sorted(a, key=lambda x: sum(x))[:-3-1:-1]
[out]:
1.04 s ± 36.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Sep 1, 2021 at 5:08

Trenton McKinney

63.2k41 gold badges170 silver badges213 bronze badges

answered Sep 1, 2021 at 1:13

Julien

15.3k6 gold badges33 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

bb1 · Accepted Answer · 2021-09-01 01:12:44Z

3

Your code almost works, but you need to compute the sum of each row before sorting. You can try this:

import numpy as np


top_n = 3
arr = np.array([[ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [ 0,  1,  2,  3,  4,  5],
       [24, 25, 26, 27, 28, 29]])

arr[np.argsort(-arr.sum(axis=1))[:top_n]]

It gives:

array([[24, 25, 26, 27, 28, 29],
       [18, 19, 20, 21, 22, 23],
       [12, 13, 14, 15, 16, 17]])

answered Sep 1, 2021 at 1:12

bb1

7,9332 gold badges11 silver badges26 bronze badges

1 Comment

Trenton McKinney Over a year ago

The answer should explain that the purpose of the - is to reverse the order

Kefeng91 · Accepted Answer · 2021-09-01 01:46:12Z

0

Without numpy, you can use the built-in function sorted combined with argument key:

sorted(A, key=lambda x: sum(x))[:-top_n-1:-1]

answered Sep 1, 2021 at 1:46

Kefeng91

8126 silver badges10 bronze badges

1 Comment

Trenton McKinney Over a year ago

This implementation is highly inefficient and should not be used with numpy arrays. For an array, np.random.randint(0, 101, (100000, 100)), this is 107 times slower than the numpy implementations.

Collectives™ on Stack Overflow

How to sort numpy array by row sum and extract top N rows

3 Answers 3

`%%timeit` comparison

Comments

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

%%timeit comparison

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related

`%%timeit` comparison