How to exclude elements contained in another column - Pyspark DataFrame

Question

Imagine you have a pyspark data frame df with three columns: A, B, C. I want to take the rows in the data frame where the value of B does not exist in C.

Example:

A B C
a 1 2
b 2 4
c 3 6
d 4 8

would return

A B C
a 1 2
c 3 6

What I tried

df.filter(~df.B.isin(df.C))

I also tried to making the values of B into a list, but that takes a significant amount of time.

@Chris loc doesn't work in pyspark, you are thinking of pandas — Antonio López Ruiz
– Antonio López Ruiz, Commented Dec 9, 2021 at 21:03

ticster · Accepted Answer · 2021-12-09 22:18:40Z

1

The problem is how you're using isin. For better or worse, isin can't actually handle another pyspark Column object as an input, it needs an actual collection. So one thing you could do is convert your column to a list :

col_values = df.select("C").rdd.flatMap(lambda x: x).collect()
df.filter(~df.B.isin(col_values))

Performance wise though, this is obviously not ideal as your master node is now in charge of manipulating the entire contents of the single column you've just loaded into memory. You could use a left anti join to get the result you need without having to transform anything into a list and losing the efficiency of spark distributed computing :

df0 = df[["C"]].withColumnRenamed("C", "B")
df.join(df0, "B", "leftanti").show()

Thanks to Emma in the comments for her contribution.

edited Dec 9, 2021 at 22:18

answered Dec 9, 2021 at 21:40

ticster

7862 gold badges6 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Emma Over a year ago

should that be left_anti instead of leftsemi?

ticster Over a year ago

And just like that, you figured it out :D Updating my answer now.

Antonio López Ruiz Over a year ago

left_anti worked like a charm, thanks!

Collectives™ on Stack Overflow

How to exclude elements contained in another column - Pyspark DataFrame

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related