I would like to rewrite this from R to Pyspark, any nice looking suggestions?
array <- c(1,2,3)
dataset <- filter(!(column %in% array))
In pyspark you can do it like this:
array = [1, 2, 3]
dataframe.filter(dataframe.column.isin(array) == False)
Or using the binary NOT operator:
dataframe.filter(~dataframe.column.isin(array))
* in *array?== operator is doing here is calling the overloaded __eq__ method on the Column result returned by dataframe.column.isin(*array). That's overloaded to return another column result to test for equality with the other argument (in this case, False). The is operator tests for object identity, that is, if the objects are actually the same place in memory. If you use is here, it would always fail because the constant False doesn't ever live in the same memory location as a Column. Additionally, you can't overload is.isin(array) and it works just fine.column.not_in() or column.is_not_in() was implemented.Take the operator ~ which means contrary :
df_filtered = df.filter(~df["column_name"].isin([1, 2, 3]))
== False when we have ~ specifically for negation?* was uselessYou can also loop the array and filter:
array = [1, 2, 3]
for i in array:
df = df.filter(df["column"] != i)