0

I have a function that takes in two parameters, one is a pyspark data frame and the other is a list of variable names from a config file. I am trying to create a loop on the list and check if those variables are null or not in the dataframe. Then append the column with the prefix "_NullCheck". Right now what is happening is that only the last variable in my list shows in the output dataframe. Can someone explain what I am doing wrong.

Here is my code so far.

def nullCheck(df, configfile2):
    nullList = getNullList(configfile2)

    for nullCol in nullList:
        subset_df = df.withColumn(f"{nullCol}_NullCheck",
                        when(df[f"{nullCol}"].isNull(), "Y" )
                        .otherwise("N"))

    return subset_df
4
  • Well you are overwriting subset_df for every iteration of your loop, so it stands to reason only the last iteration would get returned. Commented Aug 9, 2022 at 13:58
  • @Chris I see what you mean. How would I append to the dataframe instead of just iterating on the same dataset? Commented Aug 9, 2022 at 14:07
  • why not just change subset_df to df and return df Commented Aug 9, 2022 at 14:31
  • Because I am running Data Quality checks and will be storing the output in an external file. This is just one of the functions that will run through the data frame. I do not want to append it to the original data frame. Commented Aug 9, 2022 at 14:35

3 Answers 3

1

The df you are modifying is local so the scope of the function, so you can modify it in place and select only the new columns. The original df will remain unchanged.

df = spark.createDataFrame([(1, 2, None),
                            (1, None, 3),
                            (None, 2, 3)], ['col1','col2','col3'])

nullList = ['col1','col2']

def nullCheck(df):
  return df.select([when(col(c).isNull(), 'Y').otherwise('N').alias(f'{c}_Null_Check') for c in nullList])

nulls = nullCheck(df)
nulls.show()

Output

+---------------+---------------+
|col1_Null_Check|col2_Null_Check|
+---------------+---------------+
|              N|              N|
|              N|              Y|
|              Y|              N|
+---------------+---------------+
Sign up to request clarification or add additional context in comments.

1 Comment

This worked for me but the only thing I would change is the col(c). It should actually be df[f"{c}"].isNull()
0
def nullCheck(df, configfile2):
nullList = getNullList(configfile2)
subset_df = []
for nullCol in nullList:
    subset_df.append(df.withColumn(f"{nullCol}_NullCheck",
                    when(df[f"{nullCol}"].isNull(), "Y" )
                    .otherwise("N")))

return subset_df

1 Comment

This approach did not work unfortunately. I get the 'list' object has no attribute 'show' error.
0

Multiple withColumn() constructs are generally bad if there are a lot of columns (like a lot!). Try using list comprehension within select().

def nullCheck(df, configfile2):
    nullList = getNullList(configfile2)

    subset_df = df. \
        select('*', 
               *[func.when(func.col(nullCol).isNull(), func.lit('Y')).
                 otherwise(func.lit('N')).alias(nullCol+'_NullCheck') 
                 for nullCol in nullList]
               )

    return subset_df

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.