Python Pandas - Remove values from first dataframe if not in second dataframe

Question

I have user/item data for a recommender. I'm splitting it into test and train data, and I need to be sure that any new users or items in the test data are omitted before evaluating the recommender. My approach works for small datasets, but when it gets big, it takes for ever. Is there a better way to do this?

# Test set for removing users or items not in train
te = pd.DataFrame({'user': [1,2,3,1,6,1], 'item':[16,12,19,15,13,12]})
tr = pd.DataFrame({'user': [1,2,3,4,5], 'item':[11,12,13,14,15]})
print "Training_______"
print tr
print "\nTesting_______"
print te

# By using two joins and selecting the proper indices, all 'new' members of test set are removed
b = pd.merge( pd.merge(te,tr, on='user', suffixes=['', '_d']) , tr, on='item', suffixes=['', '_d'])[['user', 'item']]
print "\nSolution_______"
print b

Gives:

Training_______
   item  user
0    11     1
1    12     2
2    13     3
3    14     4
4    15     5

Testing_______
   item  user
0    16     1
1    12     2
2    19     3
3    15     1
4    13     6
5    12     1

Solution_______
   user  item
0     1    15
1     1    12
2     2    12

The solution is correct (any new users or items cause the whole row to be removed from test. But it is just slow at scale.

Thanks in advance.

Andy Hayden · Accepted Answer · 2013-08-07 20:37:17Z

5

I think you can achieve what you want using the isin Series method on each of the columns:

In [11]: te['item'].isin(tr['item']) & te['user'].isin(tr['user'])
Out[11]:
0    False
1     True
2    False
3     True
4    False
5     True
dtype: bool

In [12]: te[te['item'].isin(tr['item']) & te['user'].isin(tr['user'])]
Out[12]:
   item  user
1    12     2
3    15     1
5    12     1

In 0.13 you'll be able to use the new DataFrame isin method (on current master):

In [21]: te[te.isin(tr.to_dict(outtype='list')).all(1)]
Out[21]:
   item  user
1    12     2
3    15     1
5    12     1

hopefully by release the syntax should be a bit better on release:

te[te.isin(tr).all(1)]

answered Aug 7, 2013 at 20:37

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

zbinsd Over a year ago

Incredible. Using iteration, I was over 10 minutes. Obviously the wrong approach. Using isin to process array of length 883,918, took less than 3s. Thanks, Andy

Andy Hayden Over a year ago

@zbinsd awesome, that's quite a difference! :)

Collectives™ on Stack Overflow

Python Pandas - Remove values from first dataframe if not in second dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related