Large memory usage of scipy.sparse.csr_matrix.toarray()

Question

I have a fairly large sparse matrix A as a scipy.sparse.csr_matrix. It has the following properties:

A.shape: (77169, 77169)
A.nnz: 284811011
A.dtype: dtype('float16')

Now I have to convert it to a dense array using .toarray(). My estimate for the memory usage would be

77169**2 * (16./8.) / 1024.**3 = 11.09... GB

which would be fine as my machine has ~48GB of memory. In fact, if I create a=np.ones((77169, 77169), dtype=np.float16) that works fine and indeed a.nbytes/1024.**3 = 11.09.... However, when I run A.toarray() on the sparse matrix it packs all of memory and starts to use the swap at some point (it doesn't raise a MemoryError). Whats going wrong here? Shouldn't it easily fit into my memory?

Which version of scipy are you using? Check with import scipy; print(scipy.__version__) — Warren Weckesser
– Warren Weckesser, Commented May 16, 2017 at 14:58

hpaulj · Accepted Answer · 2017-05-16 16:28:23Z

2

For the csr toarray() does

self.tocoo(copy=False).toarray(order=order, out=out)

you could go on to trace coo.toarray, but I suspect it ends up using compiled code. But I suspect it ends up do the equivalent of:

In [715]: M=sparse.random(10,10,.2,format='csr')
In [717]: M=M.astype(np.float16)
In [718]: A = np.zeros(M.shape, M.dtype)
In [719]: Mo=M.tocoo()
In [720]: A[Mo.row, Mo.col] = Mo.data

Curiously though if I do

In [728]: Mo.toarray()
     ...
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
...
ValueError: Output dtype not compatible with inputs.

It's having trouble with the float16. Mo.astype(float).toarray() works fine. I get this error even if use toarray(out=out) with a float16 out, which makes me suspect coo_todense has been compiled with just a couple dtype alternatives. Maybe I'll dig into that later.

In [741]: scipy.__version__
Out[741]: '0.18.1'

A comment in Warren's bug report

but the xxx_todense functions are actually A += X,

suggests that the copy from Mo.data to A[] is more complicated that what indicated. toarray sums duplicates, as it would with Mo.tocsr() or Mo.sum_duplicates().

edited May 16, 2017 at 16:28

answered May 16, 2017 at 16:15

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Warren Weckesser Over a year ago

FYI: I created an issue at the scipy github repository for the error: github.com/scipy/scipy/issues/7408

hpaulj Over a year ago

The comment that float16 is not supported, raises the question of what other operations work or don't with this dtype. M*M works, but produces a float32. Evidently it converts the data dtype before passing the arrays to compiled code.

obachtos Over a year ago

OK, that leaves me with two workarounds: (1) Use csr_matrix with float32, convert to dense and then convert to float16 or (2) use a float16 csr_matrix and convert to dense manually as you suggested above. They both seem to work but the former appears to be the saver option as stated in @WarrenWeckesser s issue "I think float16 is not a supported sparse matrix type". Who knows what else might be going wrong there....

Collectives™ on Stack Overflow

Large memory usage of scipy.sparse.csr_matrix.toarray()

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related