1

As an example say I have a df

from pyspark.sql import Row

row = Row("v", "x", "y", "z")
df = sc.parallelize([
    row("p", 1, 2, 3.0), row("NULL", 3, "NULL", 5.0),
    row("NA", None, 6, 7.0), row(float("Nan"), 8, "NULL", float("NaN"))
]).toDF()

Now I want to replace NULL, NA and NaN by pyspark null (None) value. How do I achieve it for multiple columns together.

from pyspark.sql.functions import when, lit, col
def replace(column, value):
    return when(column != value, column).otherwise(lit(None))

df = df.withColumn("v", replace(col("v"), "NULL"))
df = df.withColumn("v", replace(col("v"), "NaN"))
df = df.withColumn("v", replace(col("v"), "NaN"))

Writing this for all columns is something I am trying to avoid as I can have any number of columns in my dataframe.

Appreciate your help. Thanks!

1 Answer 1

2

Loop through the columns, construct the column expressions that replace specific strings with null, then select the columns:

df.show()
+----+----+----+---+
|   v|   x|   y|  z|
+----+----+----+---+
|   p|   1|   2|3.0|
|NULL|   3|null|5.0|
|  NA|null|   6|7.0|
| NaN|   8|null|NaN|
+----+----+----+---+

import pyspark.sql.functions as F
cols = [F.when(~F.col(x).isin("NULL", "NA", "NaN"), F.col(x)).alias(x)  for x in df.columns]
df.select(*cols).show()
+----+----+----+----+
|   v|   x|   y|   z|
+----+----+----+----+
|   p|   1|   2| 3.0|
|null|   3|null| 5.0|
|null|null|   6| 7.0|
|null|   8|null|null|
+----+----+----+----+
Sign up to request clarification or add additional context in comments.

1 Comment

could there be an explanation of how " cols = [F.when(~F.col(x).isin("NULL", "NA", "NaN"), F.col(x)).alias(x) for x in df.columns] " works?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.