PySpark replace() function does not replace integer with NULL value

Question

Notice: this is for Spark version 2.1.1.2.6.1.0-129

I have a spark dataframe (Python). I would like to replace all instances of 0 across the entirety of the dataframe (without specifying particular column names), with NULL values.

The following is the code that I have written:

my_df = my_df.na.replace(0, None)

The following is the error that I receive:

  File "<stdin>", line 1, in <module>
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1634, in replace
    return self.df.replace(to_replace, value, subset)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/dataframe.py", line 1323, in replace
    raise ValueError("value should be a float, int, long, string, list, or tuple")
ValueError: value should be a float, int, long, string, list, or tuple

my_df.na.replace(0, None) works fine for me (Spark 3.x) What's your Spark version? — pltc
– pltc, Commented Oct 24, 2021 at 20:16
I have spark version 2.1.1.2.6.1.0-129. I have also updated the question to include the version — Zaki Siyaji
– Zaki Siyaji, Commented Oct 24, 2021 at 20:18

pltc · Accepted Answer · 2021-10-24 20:35:16Z

1

Apparently in Spark 2.1.1, df.na.replace does not support None. None option is only available since 2.3.0, which is not applicable in your case.

To replace values dynamically (i.e without typing columns name manually), you can use either df.columns or df.dtypes. The latter will give you the option to compare datatype as well.

from pyspark.sql import functions as F

for c in df.dtypes:
    if c[1] == 'bigint':
        df = df.withColumn(c[0], F.when(F.col(c[0]) == 0, F.lit(None)).otherwise(F.col(c[0])))

# Input
# +---+---+
# | id|val|
# +---+---+
# |  0|  a|
# |  1|  b|
# |  2|  c|
# +---+---+

# Output
# +----+---+
# |  id|val|
# +----+---+
# |null|  a|
# |   1|  b|
# |   2|  c|
# +----+---+

answered Oct 24, 2021 at 20:35

pltc

6,0371 gold badge16 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark replace() function does not replace integer with NULL value

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related