Pandas - common values for a particular column in two distinct dataframes

Question

I have two DataFrames structured like the following:

user_id movie_id    rating
438     588         5
758     588         5
913     588         5
1024    588         5
1214    588         5

user_id movie_id    rating
45      3578        3
321     3578        3
467     3578        3
758     3578        3
1024    3578        3
1381    3578        3

Is there a Pandas-native way to isolate in a list the values for user_id which appear in both DataFrames?

For the above example, the expected output would be:

[758, 1024]

--

Note: in order to help bootstrap Data Science Stack Exchange, this question has been posted with more background on datascience.stackexchange.com - if you are also a user of DSSE please help this site growing by answering directly there

Regarding the note: meta.stackexchange.com/questions/64068/… — user2285236
– user2285236, Commented Oct 29, 2016 at 15:19
@ayhan this is just a way to advertise for dsse :) -- if I just wanted good answers I would have posted it on SO only — Jivan
– Jivan, Commented Oct 29, 2016 at 15:40

MaxU - stand with Ukraine · Accepted Answer · 2016-10-29 15:26:42Z

1

you can use numpy.intersect1d() method:

In [277]: np.intersect1d(a.user_id, b.user_id).tolist()
Out[277]: [758, 1024]

or pd.core.common.intersection() method, but it seems to be slow (at least on my notebook for aa and bb DataFrames [see setup below...]):

In [307]: pd.core.common.intersection(a.user_id, b.user_id).tolist()
Out[307]: [1024, 758]

Timing for aa DF (50K rows) and bb DF (60K rows):

In [294]: aa = pd.concat([a] * 10**4, ignore_index=True)

In [295]: bb = pd.concat([b] * 10**4, ignore_index=True)

In [296]: aa.shape
Out[296]: (50000, 3)

In [297]: bb.shape
Out[297]: (60000, 3)

In [298]: %timeit aa.ix[aa.user_id.isin(bb.user_id),'user_id'].tolist()
10 loops, best of 3: 41.8 ms per loop

In [299]: %timeit np.intersect1d(aa.user_id, bb.user_id).tolist()
100 loops, best of 3: 5.36 ms per loop

In [300]: %timeit pd.merge(aa, bb, on='user_id').user_id.tolist()
...
skipped
...
MemoryError:

In [308]: %timeit pd.core.common.intersection(aa.user_id, bb.user_id).tolist()
10 loops, best of 3: 52.8 ms per loop

edited Oct 29, 2016 at 15:26

answered Oct 29, 2016 at 14:56

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jezrael Over a year ago

and you omit you first solution - can you add timings?

MaxU - stand with Ukraine Over a year ago

@jezrael, it hangs on my notebook :)

jezrael Over a year ago

No problem. Btw, it is interesting that merge fail for you :) , and numpy solution obviously is the fastest. ;)

jezrael · Accepted Answer · 2016-10-29 14:54:42Z

1

You can use isin:

print (df1.user_id.isin(df2.user_id))
0    False
1     True
2    False
3     True
4    False
Name: user_id, dtype: bool

print (df1[df1.user_id.isin(df2.user_id)])
   user_id  movie_id  rating
1      758       588       5
3     1024       588       5

print (df1.ix[df1.user_id.isin(df2.user_id),'user_id'])
1     758
3    1024
Name: user_id, dtype: int64

print (df1.ix[df1.user_id.isin(df2.user_id),'user_id'].tolist())
[758, 1024]

Another solution with merge:

print (pd.merge(df1,df2, on='user_id').user_id.tolist())
[758, 1024]

answered Oct 29, 2016 at 14:54

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Collectives™ on Stack Overflow

Pandas - common values for a particular column in two distinct dataframes

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related