3

I want to remove rows that are in one dataframe, if another dataframe has the same rows. However, I don't want to remove all the rows, only the number of rows that are in the other dataframe. Refer to this example:

df1

   col1  col2
0     1    10
1     1    10
2     2    11
3     3    12
4     1    10

df2

   col1  col2
0     1    10
1     2    11
2     1    10
3     3    12
4     3    12

Desired output:

df1

   col1  col2
      1    10

Because df1 has 3 rows of 1,10 while df2 has 2 rows of 1,10 so you remove 2 from each, leaving 1 for df1. If there were 4 rows in df1, I would want two rows of 1,10 in df1 as a result. Same with df2 below:

df2

   col1  col2
      3    12

My attempt:

I was maybe thinking of counting how many duplicates are in each of the dataframe and creating new df1 and df2 by subtracting the dupe_count but wondering if there's a more efficient way.

df1g=df1.groupby(df1.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})
df2g=df2.groupby(df2.columns.tolist(),as_index=False).size().reset_index().rename(columns={0:'dupe_count'})

2 Answers 2

3

This is a non-trivial problem, but merge is your friend:

a, b = (df.assign(count=df.groupby([*df]).cumcount()) for df in (df1, df2))    
df1[a.merge(b, on=[*a], indicator=True, how='left').eval('_merge == "left_only"')]

   col1  col2
4     1    10

The idea here is to add a cumcount column to de-duplicate the columns (assign a unique identifier to each). We can then see which rows aren't matched on a subsequent merge.

a
   col1  col2  count
0     1    10      0
1     1    10      1
2     2    11      0
3     3    12      0
4     1    10      2

b
   col1  col2  count
0     1    10      0
1     2    11      0
2     1    10      1
3     3    12      0
4     3    12      1

a.merge(b, on=[*a], indicator=True, how='left')
   col1  col2  count     _merge
0     1    10      0       both
1     1    10      1       both
2     2    11      0       both
3     3    12      0       both
4     1    10      2  left_only

_.eval('_merge == "left_only"')
0    False
1    False
2    False
3    False
4     True
dtype: bool

If you need to get unmatched rows from both df1 and df2, use an outer merge:

out = a.merge(b, on=[*a], indicator=True, how='outer')
df1_filter = (
    out.query('_merge == "left_only"').drop(['count','_merge'], axis=1))
df2_filter = (
    out.query('_merge == "right_only"').drop(['count','_merge'], axis=1))

df1_filter
   col1  col2
4     1    10

df2_filter
   col1  col2
5     3    12
Sign up to request clarification or add additional context in comments.

Comments

1

Here's another approach which use repeat:

# count of the rows
c1 = df1.groupby(['col1', 'col2']).size()
c2 = df2.groupby(['col1', 'col2']).size()

# repeat the rows by values
(c1.repeat((c1-c2).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 1   10

(c2.repeat((c2-c1).clip(0))
   .reset_index()
   .drop(0, axis=1)
)
#   col1    col2
# 0 3   12

1 Comment

Only downside to this compared to other answer is error when there are NaN in some columns: groupby().Size() giving 'Value Error : Length of passed values is x, index implies 0' issue. Maybe it will get fixed in 0.25+.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.