Pandas python how to split a DataFrame using another DF as criteria

Question

In reality my DF is huge with a lot more columns & more complex masks, but here's the principle I'm after:

DF A: (all birds)

  name            size     location
1 bluebird        small    usa
2 cukoo           medium   germany
3 parrot          large    brazil

DF B: (new world birds)

  name            size     location
1 bluebird        small    usa
2 parrot          large    brazil

I would like to split like this:

A

/ \

B C

df C should be A - B. Look in A, remove everything that's in B, and the result is C.

I wish this worked: C = A[~B] lolz it doesn't

df C should be the old world birds:

  name            size     location
1 cukoo           medium   germany

There will be no duplicate rows.

And my data is really complex (for a Sankey diagram!) So it's not practical to create df C by writing a filter like:
A.location != germany, belgium, egypt ... etc

Are you splitting off B yourself? Or are A and B arriving fully formed? Is there a subset of columns that will uniquely identify a row? — exp1orer
– exp1orer, Commented Jul 22, 2014 at 23:25
I am splitting off B myself, but through four different filters. Probably if I were a code ninja, I could combine them easily, but I'm a noob taking baby steps. There is no subset of columns that will uniquely identify a column. — Maggie
– Maggie, Commented Jul 23, 2014 at 0:03
Okay, so if you're splitting off B yourself, why not just add an id column first? something like A['id'] = range(len(A)). — exp1orer
– exp1orer, Commented Jul 23, 2014 at 0:11

chrisb · Accepted Answer · 2014-07-22 22:27:22Z

2

This should work in the generic case and be pretty quick.

First, add a dummy marker variable to B.

In [64]: B['found'] = 1.

Do a left merge of A and B, which by default merges on common columns

In [65]: C = A.merge(B, how='left')

Filter C to just those observations not found in B and drop the marker.

In [68]: C = C[pd.isnull(C['found'])].drop('found', axis=1)

In [69]: C
Out[69]: 
    name    size location
1  cukoo  medium  germany

answered Jul 22, 2014 at 22:27

chrisb

52.7k8 gold badges73 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Maggie Over a year ago

This looks good to me. I need to keep A, B and C and keep splitting until I get all the way to M! This looks good and generic enough.

exp1orer · Accepted Answer · 2014-07-23 00:15:25Z

0

Since you're splitting off B yourself, just add an id column then use that after you split B off. This is simple!

A['id'] = range(len(A))

#some code to create B

A_in_B_mask = A.id.isin(B.id)
C = A[-A_in_B_mask]

(Edited after OP's comments)

edited Jul 23, 2014 at 0:15

answered Jul 22, 2014 at 22:00

exp1orer

12.1k8 gold badges41 silver badges55 bronze badges

5 Comments

furas Over a year ago

OP doesn't have id column. OP wants to use all columns (name, size, location, ...) (treated as one column) in place of id.

exp1orer Over a year ago

How do you know that there is no id column?

exp1orer Over a year ago

Moreover, if OP is the one splitting off B, it's trivial to create an id column before that.

furas Over a year ago

How do you add the same value in id column for (parrot,large,brazil) in both dataframes ? (OP = Original Poster = person who asked the question)

Maggie Over a year ago

Ah, I am splitting off B myself! This id thing may work. FWIW, my real data (name) contains repeats like: (parrot, large, brazil); (parrot, large, peru).

Collectives™ on Stack Overflow

Pandas python how to split a DataFrame using another DF as criteria

2 Answers 2

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related