One potential workaround could be using matrix multiplication like so.
First a small example to see what is going on. x is a helper matrix, yy would correspond to your data:
>>> K,N,D = 5,10,3
>>>
>>> x = sparse.csc_matrix((np.ones(2*K),np.r_[np.arange(K),np.arange(K)],np.r_[np.arange(K+1),2*K]),(K,K+1))
>>>
>>> x.A
array([[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 1.],
[0., 0., 1., 0., 0., 1.],
[0., 0., 0., 1., 0., 1.],
[0., 0., 0., 0., 1., 1.]])
>>>
>>> y = np.random.randint(0,N,(D,K))
>>> y.sort(0)
>>> yy = sparse.csc_matrix((np.ones(D*K),y.ravel(),np.arange(K+1)*D),(N,K))
>>>
>>> yy.A
array([[1., 0., 0., 0., 0.],
[2., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 1., 1., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0.],
[0., 0., 2., 1., 0.],
[0., 0., 0., 1., 1.],
[0., 0., 0., 0., 1.]])
>>>
>>> (yy@x).A
array([[1., 0., 0., 0., 0., 1.],
[2., 1., 0., 0., 0., 3.],
[0., 1., 0., 0., 0., 1.],
[0., 1., 1., 1., 0., 3.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 1.],
[0., 0., 0., 0., 0., 0.],
[0., 0., 2., 1., 0., 3.],
[0., 0., 0., 1., 1., 2.],
[0., 0., 0., 0., 1., 1.]])
And a larger example to show it scales:
>>> K,N,D = 23_000,2_000_000,100
>>>
>>> x = sparse.csc_matrix((np.ones(2*K),np.r_[np.arange(K),np.arange(K)],np.r_[np.arange(K+1),2*K]),(K,K+1))
>>> x
<23000x23001 sparse matrix of type '<class 'numpy.float64'>'
with 46000 stored elements in Compressed Sparse Column format>
>>>
>>> y = np.random.randint(0,N,(D,K))
>>> y.sort(0)
>>> yy = sparse.csc_matrix((np.ones(D*K),y.ravel(),np.arange(K+1)*D),(N,K))
>>> yy
<2000000x23000 sparse matrix of type '<class 'numpy.float64'>'
with 2300000 stored elements in Compressed Sparse Column format>
>>>
>>> yy@x
<2000000x23001 sparse matrix of type '<class 'numpy.float64'>'
with 3667102 stored elements in Compressed Sparse Column format>
np.hstack!toarrayoften fails because the resulting array is too large for your memory. There's alsosparse.vstack.sparse.stack. First arg is supposed to be list or tuple. My answer in your old link is still right.