Split pandas dataframe into two dataframes based on another dataframe

Question

I have tried to search on Stackoverflow for the answer to this and while there are similar answers, I have tried to adapt the accepted answers and I'm struggling to achieve the result I want.

I have a dataframe:

df = pd.DataFrame({'Customer':
                     ['A', 'B', 'C', 'D'],
                          'Sales':
                     [100, 200, 300, 400],
                          'Cost':
                     [2.25, 2.50, 2.10, 3.00]})

and another one:

split = pd.DataFrame({'Customer':
                 ['B', 'D']})

I want to create two new dataframes from the original dataframe df, one containing the data from the split dataframe and the other one containing data, not in the split. I need the original structure of df to remain in both of the newly created dataframes.

I have explored isin, merge, drop and loops but there must be an elegant way to what appears to be a simple solution?

How about using join ? You should be able to do your separation with a join on "Customer". Try it and if you don't succeed, update your question with your attempt; — ma3oun
– ma3oun, Commented Jun 24, 2019 at 12:02
Thanks for the steer. I have tried result = pd.merge(df, split, on='Customer', how='outer') but it returns the same results? — Ron
– Ron, Commented Jun 24, 2019 at 12:13

jezrael · Accepted Answer · 2019-06-24 12:13:01Z

3

Use Series.isin with boolean indexing for filtering, ~ is for inverse boolen mask:

mask = df['Customer'].isin(split['Customer'])

df1 = df[mask]
print (df1)
  Customer  Sales  Cost
1        B    200   2.5
3        D    400   3.0

df2 = df[~mask]
print (df2)
  Customer  Sales  Cost
0        A    100  2.25
2        C    300  2.10

Another solution, also working if need match multiple columns with DataFrame.merge (if no parameter on it join by all columns), use outer join with indicator parameter:

df4 = df.merge(split, how='outer', indicator=True)
print (df4)
  Customer  Sales  Cost     _merge
0        A    100  2.25  left_only
1        B    200  2.50       both
2        C    300  2.10  left_only
3        D    400  3.00       both

And again filtering by different masks:

df11 = df4[df4['_merge'] == 'both']
print (df11)
  Customer  Sales  Cost _merge
1        B    200   2.5   both
3        D    400   3.0   both

df21 = df4[df4['_merge'] == 'left_only']
print (df21)
  Customer  Sales  Cost     _merge
0        A    100  2.25  left_only
2        C    300  2.10  left_only

edited Jun 24, 2019 at 12:13

answered Jun 24, 2019 at 12:06

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ron Over a year ago

slightly off topic @jezrael but rather than splitting out into two dataframes could you use a similar approach to create a new column in df and have "in split" / "not in split" populated accordingly in the new column?

jezrael Over a year ago

@Ron Use df['new'] = np.where(mask, "in split", "not in split")

Collectives™ on Stack Overflow

Split pandas dataframe into two dataframes based on another dataframe

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related