Adding a rowsum column to a large sparse matrix

Question

I have a large sparse matrix X (2 mil rows, 23k cols), and I would like to add a rowsum column on it and return a sparse matrix.

I have tried below

np.hstack( (X.toarray(),X.sum(axis=1)) )

but it doesn't work well with large sparse matrix.

The thing is, when I call X.toarray(), it blows up and terminates python kernel without giving any error message.

Similary I have tried

sparse.hstack( X ,sparse.csr_matrix(X.sum(axis=1)))
sparse.csr_matrix(X.sum(axis=1)).ndim    # is 2
X.ndim # 2 as well

but it give me below error message:

~/miniconda3/lib/python3.7/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    546 
    547     if blocks.ndim != 2:
--> 548         raise ValueError('blocks must be 2-D')
    549 
    550     M,N = blocks.shape

ValueError: blocks must be 2-D

Is there any way to work around this problem?

The linked answer says nothing about using np.hstack! toarray often fails because the resulting array is too large for your memory. There's also sparse.vstack. — hpaulj
– hpaulj, Commented Feb 9, 2020 at 0:58
Review the docs for sparse.stack. First arg is supposed to be list or tuple. My answer in your old link is still right. — hpaulj
– hpaulj, Commented Feb 9, 2020 at 1:42

hpaulj · Accepted Answer · 2020-02-09 02:10:52Z

2

In [93]: from scipy import sparse                                                              
In [94]: M = sparse.random(5,7, .2, 'csr')                                                     
In [95]: M                                                                                     
Out[95]: 
<5x7 sparse matrix of type '<class 'numpy.float64'>'
    with 7 stored elements in Compressed Sparse Row format>

One sum is a (n,1) np.matrix:

In [96]: M.sum(axis=1)                                                                         
Out[96]: 
matrix([[0.92949904],
        [1.068337  ],
        [0.10927561],
        [0.        ],
        [0.68352182]])

The other a (1,n) matrix:

In [97]: M.sum(axis=0)                                                                         
Out[97]: 
matrix([[0.        , 0.90221854, 0.42335774, 1.35578158, 0.        ,
         0.        , 0.10927561]])

add the column to the matrix (note the argument details):

In [98]: sparse.hstack((M, M.sum(axis=1)))                                                     
Out[98]: 
<5x8 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in COOrdinate format>

add the row matrix:

In [99]: sparse.vstack((M, M.sum(axis=0)))                                                     
Out[99]: 
<6x7 sparse matrix of type '<class 'numpy.float64'>'
    with 11 stored elements in COOrdinate format>

answered Feb 9, 2020 at 2:10

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Paul Panzer Over a year ago

To make my answer completely obsolete you should demonstrate that this also works with large arrays. ;-) sparse.random doesn't, actually, but feel free to borrow my workaround.

Paul Panzer · Accepted Answer · 2020-02-09 01:45:07Z

One potential workaround could be using matrix multiplication like so.

First a small example to see what is going on. x is a helper matrix, yy would correspond to your data:

>>> K,N,D = 5,10,3
>>> 
>>> x = sparse.csc_matrix((np.ones(2*K),np.r_[np.arange(K),np.arange(K)],np.r_[np.arange(K+1),2*K]),(K,K+1))
>>> 
>>> x.A
array([[1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1.],
       [0., 0., 0., 1., 0., 1.],
       [0., 0., 0., 0., 1., 1.]])
>>> 
>>> y = np.random.randint(0,N,(D,K))
>>> y.sort(0)
>>> yy = sparse.csc_matrix((np.ones(D*K),y.ravel(),np.arange(K+1)*D),(N,K))
>>> 
>>> yy.A
array([[1., 0., 0., 0., 0.],
       [2., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 1., 1., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.],
       [0., 0., 2., 1., 0.],
       [0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1.]])
>>> 
>>> (yy@x).A
array([[1., 0., 0., 0., 0., 1.],
       [2., 1., 0., 0., 0., 3.],
       [0., 1., 0., 0., 0., 1.],
       [0., 1., 1., 1., 0., 3.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 2., 1., 0., 3.],
       [0., 0., 0., 1., 1., 2.],
       [0., 0., 0., 0., 1., 1.]])

And a larger example to show it scales:

>>> K,N,D = 23_000,2_000_000,100
>>> 
>>> x = sparse.csc_matrix((np.ones(2*K),np.r_[np.arange(K),np.arange(K)],np.r_[np.arange(K+1),2*K]),(K,K+1))
>>> x
<23000x23001 sparse matrix of type '<class 'numpy.float64'>'
        with 46000 stored elements in Compressed Sparse Column format>
>>> 
>>> y = np.random.randint(0,N,(D,K))
>>> y.sort(0)
>>> yy = sparse.csc_matrix((np.ones(D*K),y.ravel(),np.arange(K+1)*D),(N,K))
>>> yy
<2000000x23000 sparse matrix of type '<class 'numpy.float64'>'
        with 2300000 stored elements in Compressed Sparse Column format>
>>> 
>>> yy@x
<2000000x23001 sparse matrix of type '<class 'numpy.float64'>'
        with 3667102 stored elements in Compressed Sparse Column format>

The X.sum(axis=1) method uses yy@x[:,-1] - that is a matrix multiply with a column of ones. Actually it uses a np.matrix(np.ones((1,n)))` so the result is also np.matrix. Since hstack combines the coo attributes of the element arrays to make a new coo, your all-in-one multiplication might well be faster.

Collectives™ on Stack Overflow

Adding a rowsum column to a large sparse matrix

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related