6

I am trying to make item-item collaborative recommendation code. My full dataset can be found here. I want the users to become rows, items to become columns, and ratings to be the values.

My code is as follows:

import pandas as pd     
import numpy as np   
file = pd.read_csv("data.csv", names=['user', 'item', 'rating', 'timestamp'])
table = pd.pivot_table(file, values='rating', index=['user'], columns=['item'])

My data is as follows:

             user        item  rating   timestamp
0  A2EFCYXHNK06IS  5555991584       5   978480000  
1  A1WR23ER5HMAA9  5555991584       5   953424000
2  A2IR4Q0GPAFJKW  5555991584       4  1393545600
3  A2V0KUVAB9HSYO  5555991584       4   966124800
4  A1J0GL9HCA7ELW  5555991584       5  1007683200

And the error is:

Traceback (most recent call last):  
  File "D:\python\reco.py", line 9, in <module>   
    table=pd.pivot_table(file,values='rating',index=['user'],columns=['item'])  
  File "C:\python35\lib\site-packages\pandas\tools\pivot.py", line 133, in   pivot_table     
        table = agged.unstack(to_unstack)   
  File "C:\python35\lib\site-packages\pandas\core\frame.py", line 4047, in       unstack  
    return unstack(self, level, fill_value)
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 402, in   unstack      
    return _unstack_multiple(obj, level)    
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 297, in   _unstack_multiple  
    unstacked = dummy.unstack('__placeholder__')  
  File "C:\python35\lib\site-packages\pandas\core\frame.py", line 4047, in   unstack  
    return unstack(self, level, fill_value)  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 406, in   unstack  
    return _unstack_frame(obj, level, fill_value=fill_value)  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 449, in   _unstack_frame  
    fill_value=fill_value)  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 103, in   __init__  
    self._make_selectors()  
  File "C:\python35\lib\site-packages\pandas\core\reshape.py", line 137, in   _make_selectors  
    mask = np.zeros(np.prod(self.full_shape), dtype=bool)  
ValueError: negative dimensions are not allowed
11
  • Possible duplicate of ValueError: negative dimensions are not allowed Commented Dec 13, 2016 at 23:33
  • @Hamms. Do not mark it as duplicate, I have already seen the link you provided. But none of the answers there is helpful to my situation. I am not doing any matrix multiplication. Commented Dec 13, 2016 at 23:55
  • please include a sample of your data: mcve. It is absolutely critical here, since this pivot_table call works for this sample data: df = pd.DataFrame(np.random.rand(10,4), columns=['user','item','rating','timestamp']). Commented Dec 14, 2016 at 0:04
  • @JulienMarrec I have added the data sample to question. Commented Dec 14, 2016 at 0:16
  • And I have absolutely no problem using your own pivot_table call with the data you provided... Try it yourself: copy the data you provided, load it with file = pd.read_clipboard() and then table=pd.pivot_table(file,values='rating',index=['user'],columns=['item']). You need to provide a MCVE: so post a sample of your data that is sufficient to replicate the error you're having. Commented Dec 14, 2016 at 0:20

1 Answer 1

6

I cannot guarantee that this will complete (I got tired of waiting for it to compute), but here's a way to create a sparse dataframe that hopefully should minimize memory and help.

import pandas as pd
import numpy as np
file=pd.read_csv("data.csv",names=['user','item','rating','timestamp'])

from scipy.sparse import csr_matrix

user_u = list(sorted(file.user.unique()))
item_u = list(sorted(file.item.unique()))

row = file.user.astype('category', categories=user_u).cat.codes
col = file.item.astype('category', categories=item_u).cat.codes

data = file['rating'].tolist()

sparse_matrix = csr_matrix((data, (row, col)), shape=(len(user_u), len(item_u)))

df = pd.SparseDataFrame([ pd.SparseSeries(sparse_matrix[i].toarray().ravel(), fill_value=0) 
                              for i in np.arange(sparse_matrix.shape[0]) ], 
                       index=user_u, columns=item_u, default_fill_value=0)

See this question for more options.

Sign up to request clarification or add additional context in comments.

3 Comments

+1, this is the only way to deal with this data. The full dense ratings matrix will have >127B entries, far too big to fit into memory. You can also use Series.cat.categories to index your sparse data frame, to avoid the list(sorted(...)) thing.
@julien I shall try this. Thanks a lot for your help. I am stuck at this problem for last two days.
Let it run and please let me know either way if it worked or not, I'm curious

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.