Say you have a dataframe that looks like the following:
df = spark.createDataFrame([
('test1', 7,['','hi',''], ['','',0]),
('', 14, ['','',6,''],[98,0,'',9])
], ["a", "b","c","d"])
df.show()
| Column A | Column B | Column C | Column D |
|---|---|---|---|
| 'test1' | 7 | ['','hi',''] | ['','',0] |
| '' | 14 | ['','',6,''] | [98,0,'',9] |
Is there a way to replace all of the empty strings inside the array columns (C and D) with None/null? It would end up looking like the following:
| Column A | Column B | Column C | Column D |
|---|---|---|---|
| 'test1' | 7 | [Null,'hi',Null] | [Null,Null,0] |
| Null | 14 | [Null,Null,6,Null] | [98,0,Null,9] |
They key is I need to keep the positional value in each array, but I would like to have Null values instead of empty strings inside the arrays.
I have been able to turn the columns that are not arrays to None with the following code:
from pyspark.sql import functions as F
df=df.select([F.when(F.col(c)=="",None).otherwise(F.col(c)).alias(c) for c in df.columns])
I have looked at the array functions in databricks documentation here: https://docs.databricks.com/en/sql/language-manual/sql-ref-functions-builtin-alpha.html
Using array_remove() I could remove all of the empty strings within the arrays, but again the challenge being I need to keep the positional values of each array with a Null value. I cannot just remove the strings without replacing the value. Is there a way to do this?