2

Imagine you have a pyspark data frame df with three columns: A, B, C. I want to take the rows in the data frame where the value of B does not exist in C.

Example:

A B C
a 1 2
b 2 4
c 3 6
d 4 8 

would return

A B C
a 1 2
c 3 6

What I tried

df.filter(~df.B.isin(df.C))

I also tried to making the values of B into a list, but that takes a significant amount of time.

1
  • 1
    @Chris loc doesn't work in pyspark, you are thinking of pandas Commented Dec 9, 2021 at 21:03

1 Answer 1

1

The problem is how you're using isin. For better or worse, isin can't actually handle another pyspark Column object as an input, it needs an actual collection. So one thing you could do is convert your column to a list :

col_values = df.select("C").rdd.flatMap(lambda x: x).collect()
df.filter(~df.B.isin(col_values))

Performance wise though, this is obviously not ideal as your master node is now in charge of manipulating the entire contents of the single column you've just loaded into memory. You could use a left anti join to get the result you need without having to transform anything into a list and losing the efficiency of spark distributed computing :

df0 = df[["C"]].withColumnRenamed("C", "B")
df.join(df0, "B", "leftanti").show()

Thanks to Emma in the comments for her contribution.

Sign up to request clarification or add additional context in comments.

3 Comments

should that be left_anti instead of leftsemi?
And just like that, you figured it out :D Updating my answer now.
left_anti worked like a charm, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.