11

I have a dataframe like this:

    ID1    ID2
0   foo    bar
1   fizz   buzz

And another like this:

    ID1    ID2    Count    Code   
0   abc    def      1        A
1   fizz   buzz     5        A
2   fizz1  buzz2    3        C
3   foo    bar      6        Z
4   foo    bar      6        Z

What I would like to do is filter the second dataframe where ID1 and ID2 match a row in the first dataframe, and whenever there's a match I want to remove that row from the first dataframe to avoid duplicates. This would yield a dataframe that looks like this:

    ID1    ID2    Count    Code   
1   fizz   buzz     5        A
3   foo    bar      6        Z

I know I can do this by nesting for loops, stepping through all the rows, and manually removing a row from the first frame whenever I get a match but I am wondering if there is a more pythonic way to do this. I am not experienced in pandas so there may be a much cleaner way to do that I do not know about. I was previously using .isin() but had to scrap it. Each ID pair can exist in the dataframe up to N times and I need the filtered frame to contain between 0 and N instances of a pair.

5 Answers 5

9

Use merge with drop_duplicates, if only same columns for join in both df:

df = pd.merge(df1,df2.drop_duplicates())
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1  fizz  buzz      5    A

If need check dupes only in ID columns:

df = pd.merge(df1,df2.drop_duplicates(subset=['ID1','ID2']))
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1  fizz  buzz      5    A

If more columns are overlaping add parameter on:

df = pd.merge(df1, df2.drop_duplicates(), on=['ID1','ID2'])

If not remove dupe rows:

df = pd.merge(df1,df2)
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1   foo   bar      6    Z
2  fizz  buzz      5    A
Sign up to request clarification or add additional context in comments.

4 Comments

Sorry I get different number of entries comparing with your method with df1.loc[(df1.ID1.isin(df1.ID1))&(df1.ID2.isin(df2.ID2)),:].drop_duplicates() with my real data. Just wonder why this issue happens?
@ahbon - yes, I think I know. Problem is if use double isin it testing each column separately, not pairs ID1, ID2
Understand, so your solution is more what I need to use. Anyway is it possible to use isin to realize same method as merge as in your solutions for pairs ID1, ID2?
@ahbon - theoretically yes, if create tuples of ID1, ID2 for both DataFrames and then compare by isin, but it is slow, beter here is merge
4

Using isin on a list of tuples

df2[
    pd.Series(
        list(zip(df2.ID1.values, df2.ID2.values))
    ).isin(list(zip(df1.ID1.values, df1.ID2.values)))
]

    ID1   ID2  Count Code
1  fizz  buzz      5    A
3   foo   bar      6    Z
4   foo   bar      6    Z

Comments

3

Try this:

df2.merge(df1[['ID1','ID2']])

Comments

3

Or maybe try this ?

df.loc[(df.ID1.isin(df1.ID1))&(df.ID2.isin(df1.ID2)),:].drop_duplicates()


Out[224]: 
    ID1   ID2  Count Code
1  fizz  buzz      5    A
3   foo   bar      6    Z

Comments

2

Merge was almost what I wanted, but didn't quite do the job because I have an odd set of requirements where I need to filter out some duplicates but not all the duplicates. A regular merge doesn't work because that keeps all the duplicates and drop_duplicates() doesn't work because I need to allow some duplicates. I ended up going with the method I described in the question and nested for loops.

temp_frame = pd.DataFrame(columns.df2.columns)
for i in xrange(len(df2)):
    for ii in xrange(len(df1)):
        if df2['ID1'].iloc[i] == df1['ID1'].iloc[ii] and df2['ID2'].iloc[i] == df1['ID2'].iloc[ii]:
            df1.drop(df1.index[ii], inplace=True)
            temp_frame = temp_frame.append(df2.iloc[i], ignore_index=True)
            break
df1 = temp_frame.copy()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.