2

I have two DataFrames structured like the following:

user_id movie_id    rating
438     588         5
758     588         5
913     588         5
1024    588         5
1214    588         5

user_id movie_id    rating
45      3578        3
321     3578        3
467     3578        3
758     3578        3
1024    3578        3
1381    3578        3

Is there a Pandas-native way to isolate in a list the values for user_id which appear in both DataFrames?

For the above example, the expected output would be:

[758, 1024]

--

Note: in order to help bootstrap Data Science Stack Exchange, this question has been posted with more background on datascience.stackexchange.com - if you are also a user of DSSE please help this site growing by answering directly there

2
  • Regarding the note: meta.stackexchange.com/questions/64068/… Commented Oct 29, 2016 at 15:19
  • @ayhan this is just a way to advertise for dsse :) -- if I just wanted good answers I would have posted it on SO only Commented Oct 29, 2016 at 15:40

2 Answers 2

1

you can use numpy.intersect1d() method:

In [277]: np.intersect1d(a.user_id, b.user_id).tolist()
Out[277]: [758, 1024]

or pd.core.common.intersection() method, but it seems to be slow (at least on my notebook for aa and bb DataFrames [see setup below...]):

In [307]: pd.core.common.intersection(a.user_id, b.user_id).tolist()
Out[307]: [1024, 758]

Timing for aa DF (50K rows) and bb DF (60K rows):

In [294]: aa = pd.concat([a] * 10**4, ignore_index=True)

In [295]: bb = pd.concat([b] * 10**4, ignore_index=True)

In [296]: aa.shape
Out[296]: (50000, 3)

In [297]: bb.shape
Out[297]: (60000, 3)

In [298]: %timeit aa.ix[aa.user_id.isin(bb.user_id),'user_id'].tolist()
10 loops, best of 3: 41.8 ms per loop

In [299]: %timeit np.intersect1d(aa.user_id, bb.user_id).tolist()
100 loops, best of 3: 5.36 ms per loop

In [300]: %timeit pd.merge(aa, bb, on='user_id').user_id.tolist()
...
skipped
...
MemoryError:

In [308]: %timeit pd.core.common.intersection(aa.user_id, bb.user_id).tolist()
10 loops, best of 3: 52.8 ms per loop
Sign up to request clarification or add additional context in comments.

3 Comments

and you omit you first solution - can you add timings?
@jezrael, it hangs on my notebook :)
No problem. Btw, it is interesting that merge fail for you :) , and numpy solution obviously is the fastest. ;)
1

You can use isin:

print (df1.user_id.isin(df2.user_id))
0    False
1     True
2    False
3     True
4    False
Name: user_id, dtype: bool

print (df1[df1.user_id.isin(df2.user_id)])
   user_id  movie_id  rating
1      758       588       5
3     1024       588       5

print (df1.ix[df1.user_id.isin(df2.user_id),'user_id'])
1     758
3    1024
Name: user_id, dtype: int64

print (df1.ix[df1.user_id.isin(df2.user_id),'user_id'].tolist())
[758, 1024]

Another solution with merge:

print (pd.merge(df1,df2, on='user_id').user_id.tolist())
[758, 1024]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.