I have a DataFrame with nested structure (originally an Avro output from a mapreduce job). I would like to flatten it. The schema of original DataFrame looks like this (simplified):
|-- key: struct
| |-- outcome: boolean
| |-- date: string
| |-- age: int
| |-- features: map
| | |-- key: string
| | |-- value: double
|-- value: struct (nullable = true)
| |-- nullString: string (nullable = true)
In Json representation one row of the data looks like this:
{"key":
{"outcome": false,
"date": "2015-01-01",
"age" : 20,
"features": {
{"f1": 10.0,
"f2": 11.0,
...
"f100": 20.1
}
},
"value": null
}
The features map has the same structure for all rows, i.e. the key set is the same (f1, f2, ..., f100). By "flatten" I mean the following.
+----------+----------+---+----+----+-...-+------+
| outcome| date|age| f1| f2| ... | f100|
+----------+----------+---+----+----+-...-+------+
| true|2015-01-01| 20|10.0|11.0| ... | 20.1|
...
(truncated)
I am using Spark 2.1.0 the spark-avro package from https://github.com/databricks/spark-avro.
The original dataframe is read in by
import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
// it's nested
df.show()
+--------------------+------+
| key| value|
+--------------------+------+
|[false,2015... |[null]|
|[false,2015... |[null]|
...
(truncated)
Any help is greatly appreciated!