1

I'm trying to convert a pandas Dataframe to a scipy sparse matrix as a way to efficiently work with many features.

However I didn't find an efficient way to access the values in the dataframe, so I always run out of memory when doing the conversion. I tried the two solutions below and they just don't work. I've researched a lot but didn't find anything better. If anyone has a suggestion I'd be happy to test it.

sparse_array = sparse.csc_matrix(df.values)
sparse_array = sparse.csc_matrix(df.to_numpy())

1 Answer 1

1

If your dataframe is very sparse you could convert it column-wise and then stack:

from scipy import sparse

sparse_array = sparse.hstack([sparse.csc_matrix(df[i].values.reshape(-1, 1)) for i in df.columns])

But probably best is to just turn it into a sparse dataframe:

for i in df.columns:
    df[i] = df[i].astype(pd.SparseDtype(df[i].dtype))

sparse_array = sparse.csc_matrix(df.sparse.to_coo())

(Note that there may be an issue if your dtypes are not homogeneous throughout the dataframe).

Sign up to request clarification or add additional context in comments.

2 Comments

Hey CJR thanks for the reply. I tested here and indeed seems that it worked. When you mentioned not homogeneous you mean that I can have an issue if I have floats and integers, for example? If yes, what sort of issue could I have?
If you're keeping it as a sparse dataframe there's no issue - the scipy sparse matrix is a single dtype though. If you have floats and ints, one will have to turn into the other if you want a matrix. (If you have a column of strings, even worse - now it's a matrix of python objects, but it'll probably crash so good news there)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.