Replace empty strings with None inside a column of Array type in a Spark Dataframe

Question

Say you have a dataframe that looks like the following:

df = spark.createDataFrame([
    ('test1', 7,['','hi',''], ['','',0]),
    ('', 14, ['','',6,''],[98,0,'',9])
], ["a", "b","c","d"])
df.show()

Column A	Column B	Column C	Column D
'test1'	7	['','hi','']	['','',0]
''	14	['','',6,'']	[98,0,'',9]

Is there a way to replace all of the empty strings inside the array columns (C and D) with None/null? It would end up looking like the following:

Column A	Column B	Column C	Column D
'test1'	7	[Null,'hi',Null]	[Null,Null,0]
Null	14	[Null,Null,6,Null]	[98,0,Null,9]

They key is I need to keep the positional value in each array, but I would like to have Null values instead of empty strings inside the arrays.

I have been able to turn the columns that are not arrays to None with the following code:

from pyspark.sql import functions as F
df=df.select([F.when(F.col(c)=="",None).otherwise(F.col(c)).alias(c) for c in df.columns])

I have looked at the array functions in databricks documentation here: https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin-alpha.html

Using array_remove() I could remove all of the empty strings within the arrays, but again the challenge being I need to keep the positional values of each array with a Null value. I cannot just remove the strings without replacing the value. Is there a way to do this?

please edit your question with a reproducible constructor for your input — mozway
– mozway, Commented Sep 28, 2023 at 18:58
Don't know if it's a duplicate, but it does look similar: stackoverflow.com/q/71193469/12846804 — OCa
– OCa, Commented Sep 28, 2023 at 20:49

Oli · Accepted Answer · 2023-09-28 19:35:00Z

0

You can simply use the transform function to apply a spark transformation to every element of an array:

df = spark.createDataFrame([
    (1, ["a", "", "b"]),
    (2, ["", "", "c"])
], ["id", "list"])
df.show()

+---+--------+
| id|    list|
+---+--------+
|  1|[a, , b]|
|  2| [, , c]|
+---+--------+

from pyspark.sql import functions as F
result = df.select(F.transform(F.col("list"),
    lambda x: F.when(x == "", F.lit(None)).otherwise(x)
  ).alias("list"))
result.show()

+---------------+                                                               
|           list|
+---------------+
|   [a, null, b]|
|[null, null, c]|
+---------------+

answered Sep 28, 2023 at 19:35

Oli

10.5k5 gold badges31 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dataengineeringhelp Over a year ago

Thanks! Didn't think to use transform that is exactly what I was looking for

Collectives™ on Stack Overflow

Replace empty strings with None inside a column of Array type in a Spark Dataframe

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related