Remove rows that are in another dataframe when there are duplicates

Question

I want to remove rows that are in one dataframe, if another dataframe has the same rows. However, I don't want to remove all the rows, only the number of rows that are in the other dataframe. Refer to this example:

df1

   col1  col2
0     1    10
1     1    10
2     2    11
3     3    12
4     1    10

df2

   col1  col2
0     1    10
1     2    11
2     1    10
3     3    12
4     3    12

Desired output:

df1

   col1  col2
      1    10

Because df1 has 3 rows of 1,10 while df2 has 2 rows of 1,10 so you remove 2 from each, leaving 1 for df1. If there were 4 rows in df1, I would want two rows of 1,10 in df1 as a result. Same with df2 below:

df2

   col1  col2
      3    12

My attempt:

I was maybe thinking of counting how many duplicates are in each of the dataframe and creating new df1 and df2 by subtracting the dupe_count but wondering if there's a more efficient way.

df1g=df1.groupby(df1.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})
df2g=df2.groupby(df2.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})

cs95 · Accepted Answer · 2019-06-20 14:46:25Z

This is a non-trivial problem, but merge is your friend:

a, b = (df.assign(count=df.groupby([*df]).cumcount()) for df in (df1, df2))    
df1[a.merge(b, on=[*a], indicator=True, how='left').eval('_merge == "left_only"')]

   col1  col2
4     1    10

The idea here is to add a cumcount column to de-duplicate the columns (assign a unique identifier to each). We can then see which rows aren't matched on a subsequent merge.

a
   col1  col2  count
0     1    10      0
1     1    10      1
2     2    11      0
3     3    12      0
4     1    10      2

b
   col1  col2  count
0     1    10      0
1     2    11      0
2     1    10      1
3     3    12      0
4     3    12      1

a.merge(b, on=[*a], indicator=True, how='left')
   col1  col2  count     _merge
0     1    10      0       both
1     1    10      1       both
2     2    11      0       both
3     3    12      0       both
4     1    10      2  left_only

_.eval('_merge == "left_only"')
0    False
1    False
2    False
3    False
4     True
dtype: bool

If you need to get unmatched rows from both df1 and df2, use an outer merge:

out = a.merge(b, on=[*a], indicator=True, how='outer')
df1_filter = (
    out.query('_merge == "left_only"').drop(['count','_merge'], axis=1))
df2_filter = (
    out.query('_merge == "right_only"').drop(['count','_merge'], axis=1))

df1_filter
   col1  col2
4     1    10

df2_filter
   col1  col2
5     3    12

Ricky Kim · Accepted Answer · 2019-06-20 15:51:31Z

1

Here's another approach which use repeat:

# count of the rows
c1 = df1.groupby(['col1', 'col2']).size()
c2 = df2.groupby(['col1', 'col2']).size()

# repeat the rows by values
(c1.repeat((c1-c2).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 1   10

(c2.repeat((c2-c1).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 3   12

edited Jun 20, 2019 at 15:51

Ricky Kim

2,0201 gold badge11 silver badges18 bronze badges

answered Jun 20, 2019 at 15:02

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

1 Comment

Ricky Kim Over a year ago

Only downside to this compared to other answer is error when there are NaN in some columns: groupby().Size() giving 'Value Error : Length of passed values is x, index implies 0' issue. Maybe it will get fixed in 0.25+.

Collectives™ on Stack Overflow

Remove rows that are in another dataframe when there are duplicates

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related