2

I have user/item data for a recommender. I'm splitting it into test and train data, and I need to be sure that any new users or items in the test data are omitted before evaluating the recommender. My approach works for small datasets, but when it gets big, it takes for ever. Is there a better way to do this?

# Test set for removing users or items not in train
te = pd.DataFrame({'user': [1,2,3,1,6,1], 'item':[16,12,19,15,13,12]})
tr = pd.DataFrame({'user': [1,2,3,4,5], 'item':[11,12,13,14,15]})
print "Training_______"
print tr
print "\nTesting_______"
print te

# By using two joins and selecting the proper indices, all 'new' members of test set are removed
b = pd.merge( pd.merge(te,tr, on='user', suffixes=['', '_d']) , tr, on='item', suffixes=['', '_d'])[['user', 'item']]
print "\nSolution_______"
print b

Gives:

Training_______
   item  user
0    11     1
1    12     2
2    13     3
3    14     4
4    15     5

Testing_______
   item  user
0    16     1
1    12     2
2    19     3
3    15     1
4    13     6
5    12     1

Solution_______
   user  item
0     1    15
1     1    12
2     2    12

The solution is correct (any new users or items cause the whole row to be removed from test. But it is just slow at scale.

Thanks in advance.

1 Answer 1

5

I think you can achieve what you want using the isin Series method on each of the columns:

In [11]: te['item'].isin(tr['item']) & te['user'].isin(tr['user'])
Out[11]:
0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

In [12]: te[te['item'].isin(tr['item']) & te['user'].isin(tr['user'])]
Out[12]:
   item  user
1    12     2
3    15     1
5    12     1

In 0.13 you'll be able to use the new DataFrame isin method (on current master):

In [21]: te[te.isin(tr.to_dict(outtype='list')).all(1)]
Out[21]:
   item  user
1    12     2
3    15     1
5    12     1

hopefully by release the syntax should be a bit better on release:

te[te.isin(tr).all(1)]
Sign up to request clarification or add additional context in comments.

2 Comments

Incredible. Using iteration, I was over 10 minutes. Obviously the wrong approach. Using isin to process array of length 883,918, took less than 3s. Thanks, Andy
@zbinsd awesome, that's quite a difference! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.