Efficiently convert Numpy 2D array of counts to zero-padded 2D array of indices?

Question

I have a numpy 2D array of n rows (observations) X m columns (features), where each element is the count of times that feature was observed. I need to convert it to a zero-padded 2D array of feature_indices, where each feature_index is repeated a number of times corresponding to the 'count' in the original 2D array.

This seems like it should be a simple combo of np.where with np.repeat or just expansion using indexing, but I'm not seeing it. Here's a very slow, loopy solution (way too slow to use in practice):

# Loopy solution (way too slow!)
def convert_2Dcountsarray_to_zeropaddedindices(countsarray2D):
    rowsums = np.sum(countsarray2D,1)
    max_rowsum = np.max(rowsums)
    out = []
    for row_idx, row in enumerate(countsarray2D):
        out_row = [0]*int(max_rowsum - rowsums[row_idx]) #Padding zeros so all out_rows same length
        for ele_idx in range(len(row)):
            [out_row.append(x) for x in np.repeat(ele_idx, row[ele_idx]) ] 
        out.append(out_row)
    return np.array(out)

# Working example
countsarray2D = np.array( [[1,2,0,1,3],
                           [0,0,0,0,3],
                           [0,1,1,0,0]] )

# Shift all features up by 1 (i.e. add a dummy feature 0 we will use for padding)
countsarray2D = np.hstack( (np.zeros((len(countsarray2D),1)), countsarray2D) )

print(convert_2Dcountsarray_to_zeropaddedindices(countsarray2D))

# Desired result:
array([[1 2 2 4 5 5 5]
       [0 0 0 0 5 5 5]
       [0 0 0 0 0 2 3]])

Shaido · Accepted Answer · 2022-06-29 02:50:18Z

1

One solution would be to flatten the array and use np.repeat.

This solution requires first adding the number of zeros to use as padding for each row to countsarray2D. This can be done as follows:

counts = countsarray2D.sum(axis=1)
max_count = max(counts)
zeros_to_add = max_count - counts
countsarray2D = np.c_[zeros_to_add, countsarray2D]

The new countsarray2D is then:

array([[0, 1, 2, 0, 1, 3],
       [4, 0, 0, 0, 0, 3],
       [5, 0, 1, 1, 0, 0]])

Now, we can flatten the array and use np.repeat. An index array A is used as the input array while countsarray2D determines the number of times each index value should be repeated.

n_rows, n_cols = countsarray2D.shape
A = np.tile(np.arange(n_cols), (n_rows, 1))
np.repeat(A, countsarray2D.flatten()).reshape(n_rows, -1)

Final result:

array([[1, 2, 2, 4, 5, 5, 5],
       [0, 0, 0, 0, 5, 5, 5],
       [0, 0, 0, 0, 0, 2, 3]])

answered Jun 29, 2022 at 2:50

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

cataclysmic Over a year ago

Thank you @Shaido. I checked this solution and it's both correct and far faster than the loopy function. This just saved a great deal of runtime.

Collectives™ on Stack Overflow

Efficiently convert Numpy 2D array of counts to zero-padded 2D array of indices?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related