Efficient accessing in sparse matrices

Question

I'm working with recommender systems but I'm struggling with the access times of the scipy sparse matrices.

In this case, I'm implementing TrustSVD so I need an efficient structure to operate both in columns and rows (CSR, CSC). I've thought about using both structures, dictionaries,... but either way this is always too slow, especially compared with the numpy matrix operations.

for u, j in zip(*ratings.nonzero()):
    items_rated_by_u = ratings[u, :].nonzero()[1]
    users_who_rated_j = ratings[:, j].nonzero()[0]
    # More code...

Extra: Each loop takes around 0.033s, so iterating once through 35,000 ratings means to wait 19min per iteration (SGD) and for a minimum of 25 iterations we're talking about 8h. Moreover, here I'm just talking about accessing, if I include the factorization part it would take around 2 days.

Are the matrices you are using banded? If so, exploiting that could be very helpful. If not, what about a list of list implementation? For a m*n matrix, this has access time~log(n), and fairly space efficient. — Oscar Smith
– Oscar Smith, Commented Jul 26, 2016 at 14:32
I suspect you need two versions of the matrix (one a transpose), and need to access the underlying data structures directly. — hpaulj
– hpaulj, Commented Jul 26, 2016 at 15:20
No, the matrices are not banded. Ratings are uniformly disperse in the matrix (actually some items tend to be more rated than others, but this is not relevant here). The other thing is to use this kind of dual or precomputed structures trading memory for speed. But I've noticed that sparse matrices are also pretty slow so I've thought to use two dictionary of arrays (one per axis). Do you have a faster solution in mind? — Salva Carrión
– Salva Carrión, Commented Jul 26, 2016 at 17:32

hpaulj · Accepted Answer · 2016-07-26 17:52:43Z

2

When you index a sparse matrix, especially just asking for a row or column, it not only has to select the values, but it also has to construct a new sparse matrix. np.ndarray construction is done in compiled code, but most of the sparse construction is pure Python. The nonzero()[1] construct requires converting the matrix to coo format and picking the row and col attributes (look at its code).

I think you could access your row columns faster by looking at the rows attribute of the lil format, or its transpose:

In [418]: sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
Out[418]: 
<3x3 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in LInked List format>
In [419]: M=sparse.lil_matrix(np.matrix('0,1,0;1,0,0;0,1,1'))
In [420]: M.A
Out[420]: 
array([[0, 1, 0],
       [1, 0, 0],
       [0, 1, 1]], dtype=int32)
In [421]: M.rows
Out[421]: array([[1], [0], [1, 2]], dtype=object)
In [422]: M[1,:].nonzero()[1]
Out[422]: array([0], dtype=int32)
In [423]: M[2,:].nonzero()[1]
Out[423]: array([1, 2], dtype=int32)
In [424]: M.T.rows
Out[424]: array([[1], [0, 2], [2]], dtype=object)

You could also access these values in the csr format, but it's a bit more complicated

In [425]: M.tocsr().indices
Out[425]: array([1, 0, 1, 2], dtype=int32)

answered Jul 26, 2016 at 17:52

hpaulj

233k14 gold badges260 silver badges392 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Salva Carrión Over a year ago

Thank you very much. This speed-up the code around x2-2.5, so the final solution was this plus a caché per axis (users, items).

Collectives™ on Stack Overflow

Efficient accessing in sparse matrices

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related