I have user/item data for a recommender. I'm splitting it into test and train data, and I need to be sure that any new users or items in the test data are omitted before evaluating the recommender. My approach works for small datasets, but when it gets big, it takes for ever. Is there a better way to do this?
# Test set for removing users or items not in train
te = pd.DataFrame({'user': [1,2,3,1,6,1], 'item':[16,12,19,15,13,12]})
tr = pd.DataFrame({'user': [1,2,3,4,5], 'item':[11,12,13,14,15]})
print "Training_______"
print tr
print "\nTesting_______"
print te
# By using two joins and selecting the proper indices, all 'new' members of test set are removed
b = pd.merge( pd.merge(te,tr, on='user', suffixes=['', '_d']) , tr, on='item', suffixes=['', '_d'])[['user', 'item']]
print "\nSolution_______"
print b
Gives:
Training_______
item user
0 11 1
1 12 2
2 13 3
3 14 4
4 15 5
Testing_______
item user
0 16 1
1 12 2
2 19 3
3 15 1
4 13 6
5 12 1
Solution_______
user item
0 1 15
1 1 12
2 2 12
The solution is correct (any new users or items cause the whole row to be removed from test. But it is just slow at scale.
Thanks in advance.