Pyspark dataframe operator "IS NOT IN"

Question

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

Ryan Widmaier · Accepted Answer · 2020-08-10 12:50:59Z

168

In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array))

edited Aug 10, 2020 at 12:50

answered Oct 27, 2016 at 15:53

Ryan Widmaier

8,6232 gold badges33 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Joe Over a year ago

What is the job of the * in *array?

Ryan Widmaier Over a year ago

*variable is python syntax for expanding an array to dump it's elements into the function parameters one at a time in order.

Josephine Moeller Over a year ago

@rjurney No. What the == operator is doing here is calling the overloaded __eq__ method on the Column result returned by dataframe.column.isin(*array). That's overloaded to return another column result to test for equality with the other argument (in this case, False). The is operator tests for object identity, that is, if the objects are actually the same place in memory. If you use is here, it would always fail because the constant False doesn't ever live in the same memory location as a Column. Additionally, you can't overload is.

Michał Jabłoński Over a year ago

List splatting with * does not make any difference here. You can just use isin(array) and it works just fine.

Joop Over a year ago

In my opinion it would have been a better design if column.not_in() or column.is_not_in() was implemented.

|

LaSul · Accepted Answer · 2018-12-18 14:26:02Z

55

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

answered Dec 18, 2018 at 14:26

LaSul

2,4412 gold badges23 silver badges39 bronze badges

2 Comments

Sasha Tsukanov Over a year ago

Everyone here, shouldn't this be the accepted answer? Why use this not-so-evident-to-understand == False when we have ~ specifically for negation?

ciurlaro Over a year ago

Also, * was useless

approxiblue · Accepted Answer · 2017-01-19 00:30:22Z

16

df_result = df[df.column_name.isin([1, 2, 3]) == False]

edited Jan 19, 2017 at 0:30

approxiblue

7,16216 gold badges53 silver badges59 bronze badges

answered Jan 19, 2017 at 0:23

user7438406

1611 silver badge2 bronze badges

Comments

Grant Shannon · Accepted Answer · 2017-11-14 09:07:52Z

4

slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)

answered Nov 14, 2017 at 9:07

Grant Shannon

5,1432 gold badges51 silver badges39 bronze badges

Comments

user2321864 · Accepted Answer · 2020-11-16 19:21:40Z

3

You can use the .subtract() buddy.

Example:

df1 = df.select(col(1),col(2),col(3)) 
df2 = df.subtract(df1)

This way, df2 will be defined as everything that is df that is not df1.

edited Nov 16, 2020 at 19:21

user2321864

2,3075 gold badges27 silver badges37 bronze badges

answered Nov 16, 2020 at 14:52

raphael dayan

311 bronze badge

Comments

Johnny M · Accepted Answer · 2020-02-18 13:58:54Z

2

* is not needed. So:

list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))

answered Feb 18, 2020 at 13:58

Johnny M

3892 silver badges7 bronze badges

Comments

yvanscher · Accepted Answer · 2023-02-23 18:20:10Z

2

You can also use sql functions .col + .isin():

import pyspark.sql.functions as F

array = [1,2,3]
df = df.filter(~F.col(column_name).isin(array))

This might be useful if you are using sql functions and want consistency.

answered Feb 23, 2023 at 18:20

yvanscher

1,0491 gold badge13 silver badges16 bronze badges

Comments

Shadowtrooper · Accepted Answer · 2019-06-18 09:26:17Z

1

You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)

answered Jun 18, 2019 at 9:26

Shadowtrooper

1,47216 silver badges32 bronze badges

3 Comments

Babu Over a year ago

I wouldn't recommend this in Big Data applications...it means you need to go through the whole dataset tree times...which is huge if you image you have few terrabytes to process

Shadowtrooper Over a year ago

No, because Spark internally optimices this filter to make in 1 time this filters.

Babu Over a year ago

then it should be ok ... until new breaking change Spark update or framework switch... and 3 rows instead 1 line + hidden optimisation seems still not good pattern for me...no offense, but I still would recommend to avoid it

Collectives™ on Stack Overflow

Pyspark dataframe operator "IS NOT IN"

8 Answers 8

8 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

8 Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related