1

In reality my DF is huge with a lot more columns & more complex masks, but here's the principle I'm after:

DF A: (all birds)

  name            size     location
1 bluebird        small    usa
2 cukoo           medium   germany
3 parrot          large    brazil

DF B: (new world birds)

  name            size     location
1 bluebird        small    usa
2 parrot          large    brazil

I would like to split like this:

A

/ \

B C

df C should be A - B. Look in A, remove everything that's in B, and the result is C.

I wish this worked: C = A[~B] lolz it doesn't

df C should be the old world birds:

  name            size     location
1 cukoo           medium   germany

There will be no duplicate rows.

And my data is really complex (for a Sankey diagram!) So it's not practical to create df C by writing a filter like:
A.location != germany, belgium, egypt ... etc

4
  • Are you splitting off B yourself? Or are A and B arriving fully formed? Is there a subset of columns that will uniquely identify a row? Commented Jul 22, 2014 at 23:25
  • I am splitting off B myself, but through four different filters. Probably if I were a code ninja, I could combine them easily, but I'm a noob taking baby steps. There is no subset of columns that will uniquely identify a column. Commented Jul 23, 2014 at 0:03
  • Okay, so if you're splitting off B yourself, why not just add an id column first? something like A['id'] = range(len(A)). Commented Jul 23, 2014 at 0:11
  • 1
    Then you can use my solution after you split off B. Commented Jul 23, 2014 at 0:12

2 Answers 2

2

This should work in the generic case and be pretty quick.

First, add a dummy marker variable to B.

In [64]: B['found'] = 1.

Do a left merge of A and B, which by default merges on common columns

In [65]: C = A.merge(B, how='left')

Filter C to just those observations not found in B and drop the marker.

In [68]: C = C[pd.isnull(C['found'])].drop('found', axis=1)

In [69]: C
Out[69]: 
    name    size location
1  cukoo  medium  germany
Sign up to request clarification or add additional context in comments.

1 Comment

This looks good to me. I need to keep A, B and C and keep splitting until I get all the way to M! This looks good and generic enough.
0

Since you're splitting off B yourself, just add an id column then use that after you split B off. This is simple!

A['id'] = range(len(A))

#some code to create B

A_in_B_mask = A.id.isin(B.id)
C = A[-A_in_B_mask]

(Edited after OP's comments)

5 Comments

OP doesn't have id column. OP wants to use all columns (name, size, location, ...) (treated as one column) in place of id.
How do you know that there is no id column?
Moreover, if OP is the one splitting off B, it's trivial to create an id column before that.
How do you add the same value in id column for (parrot,large,brazil) in both dataframes ? (OP = Original Poster = person who asked the question)
Ah, I am splitting off B myself! This id thing may work. FWIW, my real data (name) contains repeats like: (parrot, large, brazil); (parrot, large, peru).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.