2

I have a DataFrame with nested structure (originally an Avro output from a mapreduce job). I would like to flatten it. The schema of original DataFrame looks like this (simplified):

|-- key: struct
    |    |-- outcome: boolean
    |    |-- date: string
    |    |-- age: int
    |    |-- features: map
         |    |    |-- key: string
         |    |    |-- value: double
|-- value: struct (nullable = true)
    |    |-- nullString: string (nullable = true)

In Json representation one row of the data looks like this:

{"key": 
    {"outcome": false,
     "date": "2015-01-01",
     "age" : 20,
     "features": {
        {"f1": 10.0,
         "f2": 11.0,
         ...
         "f100": 20.1
        }
     },
  "value": null
 }

The features map has the same structure for all rows, i.e. the key set is the same (f1, f2, ..., f100). By "flatten" I mean the following.

+----------+----------+---+----+----+-...-+------+
|   outcome|      date|age|  f1|  f2| ... |  f100|
+----------+----------+---+----+----+-...-+------+
|      true|2015-01-01| 20|10.0|11.0| ... |  20.1|
...
(truncated)

I am using Spark 2.1.0 the spark-avro package from https://github.com/databricks/spark-avro.

The original dataframe is read in by

import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
// it's nested
df.show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|[false,2015...      |[null]|
|[false,2015...      |[null]|
...
(truncated)

Any help is greatly appreciated!

1 Answer 1

6

In Spark you can extract data from a nested AVRO file. For example, the JSON you have provided:

{"key": 
    {"outcome": false,
     "date": "2015",
     "features": {
        {"f1": v1,
         "f2": v2,
         ...
        }
     },
  "value": null
 }

after being read from AVRO:

import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")

can provide flattened data from nested JSON. For that you can write code something like this:

df.select("key.*").show
+----+------------+-------+
|date|  features  |outcome|
+----+------------+-------+
|2015| [v1,v2,...]|  false|
+----+------------+-------+
...
(truncated)

df.select("key.*").printSchema
root
 |-- date: string (nullable = true)
 |-- features: struct (nullable = true)
 |    |-- f1: string (nullable = true)
 |    |-- f2: string (nullable = true)
 |    |-- ...
 |-- outcome: boolean (nullable = true)

or something like this:

df.select("key.features.*").show
+---+---+---
| f1| f2|...
+---+---+---
| v1| v2|...
+---+---+---

...
(truncated)

df.select("key.features.*").printSchema
root
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- ...

If this is the output you are expecting.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks @himanshullTian! I didn't know you could use * as select("key.*"). That brings things a lot closer to how I need them. However df.select("key.features.*") doesn't work, since features is a map not a struct and I got org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Could you please advise? I also edited my question to make it more clear what I meant by flatten the structure.
@RandomCertainty I tried to flatten map type in Spark DataFrames but no success :( I think * schema does not work on map type.
Thanks @himanshullTian for looking into this. Some people suggested the "explode" method of DataFrame, but I don't think it apply to my situation directly. It seems one other option is to treat the original data frame as RDD of Rows, and manually map each row to the desired format with new schema, then convert back to df. I hope there's more elegant of doing this..
You are right @RandomCertainty ! explode won't work for this use case. Maybe mapping it into each row into different format may work as expected.
I spent hours trying to figure out how to explode an avro nested element. Until I found your answer. Are there further documentation you recommend.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.