I am trying to create a dataframe that returns an empty array for a nested struct type if another column is false. I created a dummy dataframe to illustrate my problem.
import spark.implicits._
val newDf = spark.createDataFrame(Seq(
("user1","true", Some(8), Some("usd"), Some("tx1")),
("user1", "true", Some(9), Some("usd"), Some("tx2")),
("user2", "false", None, None, None))).toDF("userId","flag", "amount", "currency", "transactionId")
val amountStruct = struct("amount"
,"currency").alias("amount")
val transactionStruct = struct("transactionId"
, "amount").alias("transactions")
val dataStruct = struct("flag","transactions").alias("data")
val finalDf = newDf.
withColumn("amount", amountStruct).
withColumn("transactions", transactionStruct).
select("userId", "flag","transactions").
groupBy("userId", "flag").
agg(collect_list("transactions").alias("transactions")).
withColumn("data", dataStruct).
drop("transactions","flag")
This is the output:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, [[, [,]]]]|
| user1|[true, [[tx1, [8,...|
+------+--------------------+
and schema:
root
|-- userId: string (nullable = true)
|-- data: struct (nullable = false)
| |-- flag: string (nullable = true)
| |-- transactions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- transactionId: string (nullable = true)
| | | |-- amount: struct (nullable = false)
| | | | |-- amount: integer (nullable = true)
| | | | |-- currency: string (nullable = true)
The output I want:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, []] |
| user1|[true, [[tx1, [8,...|
+------+--------------------+
I've tried doing this before doing collect_list but no luck.
import org.apache.spark.sql.functions.typedLit
val emptyArray = typedLit(Array.empty[(String, Array[(Int, String)])])
testDf.withColumn("transactions", when($"flag" === "false", emptyArray).otherwise($"transactions")).show()