1

I have tried to search on Stackoverflow for the answer to this and while there are similar answers, I have tried to adapt the accepted answers and I'm struggling to achieve the result I want.

I have a dataframe:

df = pd.DataFrame({'Customer':
                     ['A', 'B', 'C', 'D'],
                          'Sales':
                     [100, 200, 300, 400],
                          'Cost':
                     [2.25, 2.50, 2.10, 3.00]})

and another one:

split = pd.DataFrame({'Customer':
                 ['B', 'D']})

I want to create two new dataframes from the original dataframe df, one containing the data from the split dataframe and the other one containing data, not in the split. I need the original structure of df to remain in both of the newly created dataframes.

I have explored isin, merge, drop and loops but there must be an elegant way to what appears to be a simple solution?

2
  • How about using join ? You should be able to do your separation with a join on "Customer". Try it and if you don't succeed, update your question with your attempt; Commented Jun 24, 2019 at 12:02
  • Thanks for the steer. I have tried result = pd.merge(df, split, on='Customer', how='outer') but it returns the same results? Commented Jun 24, 2019 at 12:13

1 Answer 1

3

Use Series.isin with boolean indexing for filtering, ~ is for inverse boolen mask:

mask = df['Customer'].isin(split['Customer'])

df1 = df[mask]
print (df1)
  Customer  Sales  Cost
1        B    200   2.5
3        D    400   3.0

df2 = df[~mask]
print (df2)
  Customer  Sales  Cost
0        A    100  2.25
2        C    300  2.10

Another solution, also working if need match multiple columns with DataFrame.merge (if no parameter on it join by all columns), use outer join with indicator parameter:

df4 = df.merge(split, how='outer', indicator=True)
print (df4)
  Customer  Sales  Cost     _merge
0        A    100  2.25  left_only
1        B    200  2.50       both
2        C    300  2.10  left_only
3        D    400  3.00       both

And again filtering by different masks:

df11 = df4[df4['_merge'] == 'both']
print (df11)
  Customer  Sales  Cost _merge
1        B    200   2.5   both
3        D    400   3.0   both

df21 = df4[df4['_merge'] == 'left_only']
print (df21)
  Customer  Sales  Cost     _merge
0        A    100  2.25  left_only
2        C    300  2.10  left_only
Sign up to request clarification or add additional context in comments.

2 Comments

slightly off topic @jezrael but rather than splitting out into two dataframes could you use a similar approach to create a new column in df and have "in split" / "not in split" populated accordingly in the new column?
@Ron Use df['new'] = np.where(mask, "in split", "not in split")

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.