Loop is not working as expected on the pyspark dataframe

Question

I have a function that takes in two parameters, one is a pyspark data frame and the other is a list of variable names from a config file. I am trying to create a loop on the list and check if those variables are null or not in the dataframe. Then append the column with the prefix "_NullCheck". Right now what is happening is that only the last variable in my list shows in the output dataframe. Can someone explain what I am doing wrong.

Here is my code so far.

def nullCheck(df, configfile2):
    nullList = getNullList(configfile2)

    for nullCol in nullList:
        subset_df = df.withColumn(f"{nullCol}_NullCheck",
                        when(df[f"{nullCol}"].isNull(), "Y" )
                        .otherwise("N"))

    return subset_df

Well you are overwriting subset_df for every iteration of your loop, so it stands to reason only the last iteration would get returned. — Chris
– Chris, Commented Aug 9, 2022 at 13:58
@Chris I see what you mean. How would I append to the dataframe instead of just iterating on the same dataset? — Murtaza Mohsin
– Murtaza Mohsin, Commented Aug 9, 2022 at 14:07
Because I am running Data Quality checks and will be storing the output in an external file. This is just one of the functions that will run through the data frame. I do not want to append it to the original data frame. — Murtaza Mohsin
– Murtaza Mohsin, Commented Aug 9, 2022 at 14:35

Chris · Accepted Answer · 2022-08-09 15:17:37Z

1

The df you are modifying is local so the scope of the function, so you can modify it in place and select only the new columns. The original df will remain unchanged.

df = spark.createDataFrame([(1, 2, None),
                            (1, None, 3),
                            (None, 2, 3)], ['col1','col2','col3'])

nullList = ['col1','col2']

def nullCheck(df):
  return df.select([when(col(c).isNull(), 'Y').otherwise('N').alias(f'{c}_Null_Check') for c in nullList])

nulls = nullCheck(df)
nulls.show()

Output

+---------------+---------------+
|col1_Null_Check|col2_Null_Check|
+---------------+---------------+
|              N|              N|
|              N|              Y|
|              Y|              N|
+---------------+---------------+

answered Aug 9, 2022 at 15:17

Chris

16.3k3 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Murtaza Mohsin Over a year ago

This worked for me but the only thing I would change is the col(c). It should actually be df[f"{c}"].isNull()

Moshe Eichler · Accepted Answer · 2022-08-09 14:05:52Z

0

def nullCheck(df, configfile2):
nullList = getNullList(configfile2)
subset_df = []
for nullCol in nullList:
    subset_df.append(df.withColumn(f"{nullCol}_NullCheck",
                    when(df[f"{nullCol}"].isNull(), "Y" )
                    .otherwise("N")))

return subset_df

answered Aug 9, 2022 at 14:05

Moshe Eichler

815 bronze badges

1 Comment

Murtaza Mohsin Over a year ago

This approach did not work unfortunately. I get the 'list' object has no attribute 'show' error.

samkart · Accepted Answer · 2022-08-09 15:05:03Z

0

Multiple withColumn() constructs are generally bad if there are a lot of columns (like a lot!). Try using list comprehension within select().

def nullCheck(df, configfile2):
    nullList = getNullList(configfile2)

    subset_df = df. \
        select('*', 
               *[func.when(func.col(nullCol).isNull(), func.lit('Y')).
                 otherwise(func.lit('N')).alias(nullCol+'_NullCheck') 
                 for nullCol in nullList]
               )

    return subset_df

answered Aug 9, 2022 at 15:05

samkart

6,7133 gold badges19 silver badges35 bronze badges

Collectives™ on Stack Overflow

Loop is not working as expected on the pyspark dataframe

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related