0

I have a large sparse matrix X (2 mil rows, 23k cols), and I would like to add a rowsum column on it and return a sparse matrix.

I have tried below

np.hstack( (X.toarray(),X.sum(axis=1)) )

but it doesn't work well with large sparse matrix.

The thing is, when I call X.toarray(), it blows up and terminates python kernel without giving any error message.

Similary I have tried

sparse.hstack( X ,sparse.csr_matrix(X.sum(axis=1)))
sparse.csr_matrix(X.sum(axis=1)).ndim    # is 2
X.ndim # 2 as well

but it give me below error message:

~/miniconda3/lib/python3.7/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    546 
    547     if blocks.ndim != 2:
--> 548         raise ValueError('blocks must be 2-D')
    549 
    550     M,N = blocks.shape

ValueError: blocks must be 2-D

Is there any way to work around this problem?

3
  • The linked answer says nothing about using np.hstack! toarray often fails because the resulting array is too large for your memory. There's also sparse.vstack. Commented Feb 9, 2020 at 0:58
  • @hpaulj Somehow I got the wrong link, I removed it. Commented Feb 9, 2020 at 1:03
  • 1
    Review the docs for sparse.stack. First arg is supposed to be list or tuple. My answer in your old link is still right. Commented Feb 9, 2020 at 1:42

2 Answers 2

2
In [93]: from scipy import sparse                                                              
In [94]: M = sparse.random(5,7, .2, 'csr')                                                     
In [95]: M                                                                                     
Out[95]: 
<5x7 sparse matrix of type '<class 'numpy.float64'>'
    with 7 stored elements in Compressed Sparse Row format>

One sum is a (n,1) np.matrix:

In [96]: M.sum(axis=1)                                                                         
Out[96]: 
matrix([[0.92949904],
        [1.068337  ],
        [0.10927561],
        [0.        ],
        [0.68352182]])

The other a (1,n) matrix:

In [97]: M.sum(axis=0)                                                                         
Out[97]: 
matrix([[0.        , 0.90221854, 0.42335774, 1.35578158, 0.        ,
         0.        , 0.10927561]])

add the column to the matrix (note the argument details):

In [98]: sparse.hstack((M, M.sum(axis=1)))                                                     
Out[98]: 
<5x8 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in COOrdinate format>

add the row matrix:

In [99]: sparse.vstack((M, M.sum(axis=0)))                                                     
Out[99]: 
<6x7 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in COOrdinate format>
Sign up to request clarification or add additional context in comments.

1 Comment

To make my answer completely obsolete you should demonstrate that this also works with large arrays. ;-) sparse.random doesn't, actually, but feel free to borrow my workaround.
1

One potential workaround could be using matrix multiplication like so.

First a small example to see what is going on. x is a helper matrix, yy would correspond to your data:

>>> K,N,D = 5,10,3
>>> 
>>> x = sparse.csc_matrix((np.ones(2*K),np.r_[np.arange(K),np.arange(K)],np.r_[np.arange(K+1),2*K]),(K,K+1))
>>> 
>>> x.A
array([[1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1.],
       [0., 0., 0., 1., 0., 1.],
       [0., 0., 0., 0., 1., 1.]])
>>> 
>>> y = np.random.randint(0,N,(D,K))
>>> y.sort(0)
>>> yy = sparse.csc_matrix((np.ones(D*K),y.ravel(),np.arange(K+1)*D),(N,K))
>>> 
>>> yy.A
array([[1., 0., 0., 0., 0.],
       [2., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 1., 1., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.],
       [0., 0., 2., 1., 0.],
       [0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1.]])
>>> 
>>> (yy@x).A
array([[1., 0., 0., 0., 0., 1.],
       [2., 1., 0., 0., 0., 3.],
       [0., 1., 0., 0., 0., 1.],
       [0., 1., 1., 1., 0., 3.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 2., 1., 0., 3.],
       [0., 0., 0., 1., 1., 2.],
       [0., 0., 0., 0., 1., 1.]])

And a larger example to show it scales:

>>> K,N,D = 23_000,2_000_000,100
>>> 
>>> x = sparse.csc_matrix((np.ones(2*K),np.r_[np.arange(K),np.arange(K)],np.r_[np.arange(K+1),2*K]),(K,K+1))
>>> x
<23000x23001 sparse matrix of type '<class 'numpy.float64'>'
        with 46000 stored elements in Compressed Sparse Column format>
>>> 
>>> y = np.random.randint(0,N,(D,K))
>>> y.sort(0)
>>> yy = sparse.csc_matrix((np.ones(D*K),y.ravel(),np.arange(K+1)*D),(N,K))
>>> yy
<2000000x23000 sparse matrix of type '<class 'numpy.float64'>'
        with 2300000 stored elements in Compressed Sparse Column format>
>>> 
>>> yy@x
<2000000x23001 sparse matrix of type '<class 'numpy.float64'>'
        with 3667102 stored elements in Compressed Sparse Column format>

1 Comment

The X.sum(axis=1) method uses yy@x[:,-1] - that is a matrix multiply with a column of ones. Actually it uses a np.matrix(np.ones((1,n)))` so the result is also np.matrix. Since hstack combines the coo attributes of the element arrays to make a new coo, your all-in-one multiplication might well be faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.