You have to use explode function on the map columns first to destructure maps into key and value columns, union the result datasets followed by distinct to de-duplicate and only then groupBy with some custom Scala coding to aggregate the maps.
Stop talking and let's do some coding then...
Given the datasets:
scala> a.show(false)
+---+-----------------------+
|id |cMap |
+---+-----------------------+
|one|Map(1 -> one, 2 -> two)|
+---+-----------------------+
scala> a.printSchema
root
|-- id: string (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
scala> b.show(false)
+---+-------------+
|id |cMap |
+---+-------------+
|one|Map(1 -> one)|
+---+-------------+
scala> b.printSchema
root
|-- id: string (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
You should first use explode function on the map columns.
explode(e: Column): Column Creates a new row for each element in the given array or map column.
val a_keyValues = a.select('*, explode($"cMap"))
scala> a_keyValues.show(false)
+---+-----------------------+---+-----+
|id |cMap |key|value|
+---+-----------------------+---+-----+
|one|Map(1 -> one, 2 -> two)|1 |one |
|one|Map(1 -> one, 2 -> two)|2 |two |
+---+-----------------------+---+-----+
val b_keyValues = b.select('*, explode($"cMap"))
With the following you have distinct key-value pairs which is exactly deduplication you asked for.
val distinctKeyValues = a_keyValues.
union(b_keyValues).
select("id", "key", "value").
distinct // <-- deduplicate
scala> distinctKeyValues.show(false)
+---+---+-----+
|id |key|value|
+---+---+-----+
|one|1 |one |
|one|2 |two |
+---+---+-----+
Time for groupBy and create the final map column.
val result = distinctKeyValues.
withColumn("map", map($"key", $"value")).
groupBy("id").
agg(collect_list("map")).
as[(String, Seq[Map[String, String]])]. // <-- leave Rows for typed pairs
map { case (id, list) => (id, list.reduce(_ ++ _)) }. // <-- collect all entries under one map
toDF("id", "cMap") // <-- give the columns their names
scala> result.show(truncate = false)
+---+-----------------------+
|id |cMap |
+---+-----------------------+
|one|Map(1 -> one, 2 -> two)|
+---+-----------------------+
Please note that as of Spark 2.0.0 unionAll has been deprecated and union is the proper union operator:
(Since version 2.0.0) use union()
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectListbut you have to use isimport org.apache.spark.sql.functions.collect_listand it should work then