1

I'm working with recommender systems but I'm struggling with the access times of the scipy sparse matrices.

In this case, I'm implementing TrustSVD so I need an efficient structure to operate both in columns and rows (CSR, CSC). I've thought about using both structures, dictionaries,... but either way this is always too slow, especially compared with the numpy matrix operations.

for u, j in zip(*ratings.nonzero()):
    items_rated_by_u = ratings[u, :].nonzero()[1]
    users_who_rated_j = ratings[:, j].nonzero()[0]
    # More code...

Extra: Each loop takes around 0.033s, so iterating once through 35,000 ratings means to wait 19min per iteration (SGD) and for a minimum of 25 iterations we're talking about 8h. Moreover, here I'm just talking about accessing, if I include the factorization part it would take around 2 days.

3
  • Are the matrices you are using banded? If so, exploiting that could be very helpful. If not, what about a list of list implementation? For a m*n matrix, this has access time~log(n), and fairly space efficient. Commented Jul 26, 2016 at 14:32
  • I suspect you need two versions of the matrix (one a transpose), and need to access the underlying data structures directly. Commented Jul 26, 2016 at 15:20
  • No, the matrices are not banded. Ratings are uniformly disperse in the matrix (actually some items tend to be more rated than others, but this is not relevant here). The other thing is to use this kind of dual or precomputed structures trading memory for speed. But I've noticed that sparse matrices are also pretty slow so I've thought to use two dictionary of arrays (one per axis). Do you have a faster solution in mind? Commented Jul 26, 2016 at 17:32

1 Answer 1

2

When you index a sparse matrix, especially just asking for a row or column, it not only has to select the values, but it also has to construct a new sparse matrix. np.ndarray construction is done in compiled code, but most of the sparse construction is pure Python. The nonzero()[1] construct requires converting the matrix to coo format and picking the row and col attributes (look at its code).

I think you could access your row columns faster by looking at the rows attribute of the lil format, or its transpose:

In [418]: sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
Out[418]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in LInked List format>
In [419]: M=sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
In [420]: M.A
Out[420]: 
array([[0, 1, 0],
       [1, 0, 0],
       [0, 1, 1]], dtype=int32)
In [421]: M.rows
Out[421]: array([[1], [0], [1, 2]], dtype=object)
In [422]: M[1,:].nonzero()[1]
Out[422]: array([0], dtype=int32)
In [423]: M[2,:].nonzero()[1]
Out[423]: array([1, 2], dtype=int32)
In [424]: M.T.rows
Out[424]: array([[1], [0, 2], [2]], dtype=object)

You could also access these values in the csr format, but it's a bit more complicated

In [425]: M.tocsr().indices
Out[425]: array([1, 0, 1, 2], dtype=int32)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much. This speed-up the code around x2-2.5, so the final solution was this plus a caché per axis (users, items).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.