83

I would like to rewrite this from R to Pyspark, any nice looking suggestions?

array <- c(1,2,3)
dataset <- filter(!(column %in% array))

8 Answers 8

168

In pyspark you can do it like this:

array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)

Or using the binary NOT operator:

dataframe.filter(~dataframe.column.isin(array))
Sign up to request clarification or add additional context in comments.

8 Comments

What is the job of the * in *array?
*variable is python syntax for expanding an array to dump it's elements into the function parameters one at a time in order.
@rjurney No. What the == operator is doing here is calling the overloaded __eq__ method on the Column result returned by dataframe.column.isin(*array). That's overloaded to return another column result to test for equality with the other argument (in this case, False). The is operator tests for object identity, that is, if the objects are actually the same place in memory. If you use is here, it would always fail because the constant False doesn't ever live in the same memory location as a Column. Additionally, you can't overload is.
List splatting with * does not make any difference here. You can just use isin(array) and it works just fine.
In my opinion it would have been a better design if column.not_in() or column.is_not_in() was implemented.
|
55

Take the operator ~ which means contrary :

df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))

2 Comments

Everyone here, shouldn't this be the accepted answer? Why use this not-so-evident-to-understand == False when we have ~ specifically for negation?
Also, * was useless
16
df_result = df[df.column_name.isin([1, 2, 3]) == False]

Comments

4

slightly different syntax and a "date" data set:

toGetDates={'2017-11-09', '2017-11-11', '2017-11-12'}
df= df.filter(df['DATE'].isin(toGetDates) == False)

Comments

3

You can use the .subtract() buddy.

Example:

df1 = df.select(col(1),col(2),col(3)) 
df2 = df.subtract(df1)

This way, df2 will be defined as everything that is df that is not df1.

Comments

2

* is not needed. So:

list = [1, 2, 3]
dataframe.filter(~dataframe.column.isin(list))

Comments

2

You can also use sql functions .col + .isin():

import pyspark.sql.functions as F

array = [1,2,3]
df = df.filter(~F.col(column_name).isin(array))

This might be useful if you are using sql functions and want consistency.

Comments

1

You can also loop the array and filter:

array = [1, 2, 3]
for i in array:
    df = df.filter(df["column"] != i)

3 Comments

I wouldn't recommend this in Big Data applications...it means you need to go through the whole dataset tree times...which is huge if you image you have few terrabytes to process
No, because Spark internally optimices this filter to make in 1 time this filters.
then it should be ok ... until new breaking change Spark update or framework switch... and 3 rows instead 1 line + hidden optimisation seems still not good pattern for me...no offense, but I still would recommend to avoid it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.