1

Say you have a dataframe that looks like the following:

df = spark.createDataFrame([
    ('test1', 7,['','hi',''], ['','',0]),
    ('', 14, ['','',6,''],[98,0,'',9])
], ["a", "b","c","d"])
df.show()
Column A Column B Column C Column D
'test1' 7 ['','hi',''] ['','',0]
'' 14 ['','',6,''] [98,0,'',9]

Is there a way to replace all of the empty strings inside the array columns (C and D) with None/null? It would end up looking like the following:

Column A Column B Column C Column D
'test1' 7 [Null,'hi',Null] [Null,Null,0]
Null 14 [Null,Null,6,Null] [98,0,Null,9]

They key is I need to keep the positional value in each array, but I would like to have Null values instead of empty strings inside the arrays.

I have been able to turn the columns that are not arrays to None with the following code:

from pyspark.sql import functions as F
df=df.select([F.when(F.col(c)=="",None).otherwise(F.col(c)).alias(c) for c in df.columns])

I have looked at the array functions in databricks documentation here: https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin-alpha.html

Using array_remove() I could remove all of the empty strings within the arrays, but again the challenge being I need to keep the positional values of each array with a Null value. I cannot just remove the strings without replacing the value. Is there a way to do this?

3
  • please edit your question with a reproducible constructor for your input Commented Sep 28, 2023 at 18:58
  • Try to use transform Commented Sep 28, 2023 at 19:21
  • Don't know if it's a duplicate, but it does look similar: stackoverflow.com/q/71193469/12846804 Commented Sep 28, 2023 at 20:49

1 Answer 1

0

You can simply use the transform function to apply a spark transformation to every element of an array:

df = spark.createDataFrame([
    (1, ["a", "", "b"]),
    (2, ["", "", "c"])
], ["id", "list"])
df.show()
+---+--------+
| id|    list|
+---+--------+
|  1|[a, , b]|
|  2| [, , c]|
+---+--------+
from pyspark.sql import functions as F
result = df.select(F.transform(F.col("list"),
    lambda x: F.when(x == "", F.lit(None)).otherwise(x)
  ).alias("list"))
result.show()
+---------------+                                                               
|           list|
+---------------+
|   [a, null, b]|
|[null, null, c]|
+---------------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! Didn't think to use transform that is exactly what I was looking for

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.