5

I have a fairly large sparse matrix A as a scipy.sparse.csr_matrix. It has the following properties:

A.shape: (77169, 77169)
A.nnz: 284811011
A.dtype: dtype('float16')

Now I have to convert it to a dense array using .toarray(). My estimate for the memory usage would be

77169**2 * (16./8.) / 1024.**3 = 11.09... GB

which would be fine as my machine has ~48GB of memory. In fact, if I create a=np.ones((77169, 77169), dtype=np.float16) that works fine and indeed a.nbytes/1024.**3 = 11.09.... However, when I run A.toarray() on the sparse matrix it packs all of memory and starts to use the swap at some point (it doesn't raise a MemoryError). Whats going wrong here? Shouldn't it easily fit into my memory?

2
  • Which version of scipy are you using? Check with import scipy; print(scipy.__version__) Commented May 16, 2017 at 14:58
  • Oh, right, I forgot: SciPy version is 0.15.1 Commented May 16, 2017 at 16:24

1 Answer 1

2

For the csr toarray() does

self.tocoo(copy=False).toarray(order=order, out=out)

you could go on to trace coo.toarray, but I suspect it ends up using compiled code. But I suspect it ends up do the equivalent of:

In [715]: M=sparse.random(10,10,.2,format='csr')
In [717]: M=M.astype(np.float16)
In [718]: A = np.zeros(M.shape, M.dtype)
In [719]: Mo=M.tocoo()
In [720]: A[Mo.row, Mo.col] = Mo.data

Curiously though if I do

In [728]: Mo.toarray()
     ...
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
...
ValueError: Output dtype not compatible with inputs.

It's having trouble with the float16. Mo.astype(float).toarray() works fine. I get this error even if use toarray(out=out) with a float16 out, which makes me suspect coo_todense has been compiled with just a couple dtype alternatives. Maybe I'll dig into that later.

In [741]: scipy.__version__
Out[741]: '0.18.1'

A comment in Warren's bug report

but the xxx_todense functions are actually A += X,

suggests that the copy from Mo.data to A[] is more complicated that what indicated. toarray sums duplicates, as it would with Mo.tocsr() or Mo.sum_duplicates().

Sign up to request clarification or add additional context in comments.

3 Comments

FYI: I created an issue at the scipy github repository for the error: github.com/scipy/scipy/issues/7408
The comment that float16 is not supported, raises the question of what other operations work or don't with this dtype. M*M works, but produces a float32. Evidently it converts the data dtype before passing the arrays to compiled code.
OK, that leaves me with two workarounds: (1) Use csr_matrix with float32, convert to dense and then convert to float16 or (2) use a float16 csr_matrix and convert to dense manually as you suggested above. They both seem to work but the former appears to be the saver option as stated in @WarrenWeckesser s issue "I think float16 is not a supported sparse matrix type". Who knows what else might be going wrong there....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.