I have a function that takes in two parameters, one is a pyspark data frame and the other is a list of variable names from a config file. I am trying to create a loop on the list and check if those variables are null or not in the dataframe. Then append the column with the prefix "_NullCheck". Right now what is happening is that only the last variable in my list shows in the output dataframe. Can someone explain what I am doing wrong.
Here is my code so far.
def nullCheck(df, configfile2):
nullList = getNullList(configfile2)
for nullCol in nullList:
subset_df = df.withColumn(f"{nullCol}_NullCheck",
when(df[f"{nullCol}"].isNull(), "Y" )
.otherwise("N"))
return subset_df
subset_dftodfand returndf