1

I have a numpy 2D array of n rows (observations) X m columns (features), where each element is the count of times that feature was observed. I need to convert it to a zero-padded 2D array of feature_indices, where each feature_index is repeated a number of times corresponding to the 'count' in the original 2D array.

This seems like it should be a simple combo of np.where with np.repeat or just expansion using indexing, but I'm not seeing it. Here's a very slow, loopy solution (way too slow to use in practice):

# Loopy solution (way too slow!)
def convert_2Dcountsarray_to_zeropaddedindices(countsarray2D):
    rowsums = np.sum(countsarray2D,1)
    max_rowsum = np.max(rowsums)
    out = []
    for row_idx, row in enumerate(countsarray2D):
        out_row = [0]*int(max_rowsum - rowsums[row_idx]) #Padding zeros so all out_rows same length
        for ele_idx in range(len(row)):
            [out_row.append(x) for x in np.repeat(ele_idx, row[ele_idx]) ] 
        out.append(out_row)
    return np.array(out)

# Working example
countsarray2D = np.array( [[1,2,0,1,3],
                           [0,0,0,0,3],
                           [0,1,1,0,0]] )

# Shift all features up by 1 (i.e. add a dummy feature 0 we will use for padding)
countsarray2D = np.hstack( (np.zeros((len(countsarray2D),1)), countsarray2D) )

print(convert_2Dcountsarray_to_zeropaddedindices(countsarray2D))

# Desired result:
array([[1 2 2 4 5 5 5]
       [0 0 0 0 5 5 5]
       [0 0 0 0 0 2 3]])

1 Answer 1

1

One solution would be to flatten the array and use np.repeat.

This solution requires first adding the number of zeros to use as padding for each row to countsarray2D. This can be done as follows:

counts = countsarray2D.sum(axis=1)
max_count = max(counts)
zeros_to_add = max_count - counts
countsarray2D = np.c_[zeros_to_add, countsarray2D]

The new countsarray2D is then:

array([[0, 1, 2, 0, 1, 3],
       [4, 0, 0, 0, 0, 3],
       [5, 0, 1, 1, 0, 0]])

Now, we can flatten the array and use np.repeat. An index array A is used as the input array while countsarray2D determines the number of times each index value should be repeated.

n_rows, n_cols = countsarray2D.shape
A = np.tile(np.arange(n_cols), (n_rows, 1))
np.repeat(A, countsarray2D.flatten()).reshape(n_rows, -1)

Final result:

array([[1, 2, 2, 4, 5, 5, 5],
       [0, 0, 0, 0, 5, 5, 5],
       [0, 0, 0, 0, 0, 2, 3]])
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you @Shaido. I checked this solution and it's both correct and far faster than the loopy function. This just saved a great deal of runtime.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.