Extract Spark data frame from a nested structure

Question

I have a DataFrame with nested structure (originally an Avro output from a mapreduce job). I would like to flatten it. The schema of original DataFrame looks like this (simplified):

|-- key: struct
    |    |-- outcome: boolean
    |    |-- date: string
    |    |-- age: int
    |    |-- features: map
         |    |    |-- key: string
         |    |    |-- value: double
|-- value: struct (nullable = true)
    |    |-- nullString: string (nullable = true)

In Json representation one row of the data looks like this:

{"key": 
    {"outcome": false,
     "date": "2015-01-01",
     "age" : 20,
     "features": {
        {"f1": 10.0,
         "f2": 11.0,
         ...
         "f100": 20.1
        }
     },
  "value": null
 }

The features map has the same structure for all rows, i.e. the key set is the same (f1, f2, ..., f100). By "flatten" I mean the following.

+----------+----------+---+----+----+-...-+------+
|   outcome|      date|age|  f1|  f2| ... |  f100|
+----------+----------+---+----+----+-...-+------+
|      true|2015-01-01| 20|10.0|11.0| ... |  20.1|
...
(truncated)

I am using Spark 2.1.0 the spark-avro package from https://github.com/databricks/spark-avro.

The original dataframe is read in by

import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")
// it's nested
df.show()
+--------------------+------+
|                 key| value|
+--------------------+------+
|[false,2015...      |[null]|
|[false,2015...      |[null]|
...
(truncated)

Any help is greatly appreciated!

himanshuIIITian · Accepted Answer · 2017-04-11 09:22:51Z

6

In Spark you can extract data from a nested AVRO file. For example, the JSON you have provided:

{"key": 
    {"outcome": false,
     "date": "2015",
     "features": {
        {"f1": v1,
         "f2": v2,
         ...
        }
     },
  "value": null
 }

after being read from AVRO:

import com.databricks.spark.avro._
val df = spark.read.avro("path/to/my/file.avro")

can provide flattened data from nested JSON. For that you can write code something like this:

df.select("key.*").show
+----+------------+-------+
|date|  features  |outcome|
+----+------------+-------+
|2015| [v1,v2,...]|  false|
+----+------------+-------+
...
(truncated)

df.select("key.*").printSchema
root
 |-- date: string (nullable = true)
 |-- features: struct (nullable = true)
 |    |-- f1: string (nullable = true)
 |    |-- f2: string (nullable = true)
 |    |-- ...
 |-- outcome: boolean (nullable = true)

or something like this:

df.select("key.features.*").show
+---+---+---
| f1| f2|...
+---+---+---
| v1| v2|...
+---+---+---

...
(truncated)

df.select("key.features.*").printSchema
root
 |-- f1: string (nullable = true)
 |-- f2: string (nullable = true)
 |-- ...

If this is the output you are expecting.

answered Apr 11, 2017 at 9:22

himanshuIIITian

6,1157 gold badges55 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Random Certainty Over a year ago

Thanks @himanshullTian! I didn't know you could use * as select("key.*"). That brings things a lot closer to how I need them. However df.select("key.features.*") doesn't work, since features is a map not a struct and I got org.apache.spark.sql.AnalysisException: Can only star expand struct data types. Could you please advise? I also edited my question to make it more clear what I meant by flatten the structure.

himanshuIIITian Over a year ago

@RandomCertainty I tried to flatten map type in Spark DataFrames but no success :( I think * schema does not work on map type.

Random Certainty Over a year ago

Thanks @himanshullTian for looking into this. Some people suggested the "explode" method of DataFrame, but I don't think it apply to my situation directly. It seems one other option is to treat the original data frame as RDD of Rows, and manually map each row to the desired format with new schema, then convert back to df. I hope there's more elegant of doing this..

himanshuIIITian Over a year ago

You are right @RandomCertainty ! explode won't work for this use case. Maybe mapping it into each row into different format may work as expected.

hba Over a year ago

I spent hours trying to figure out how to explode an avro nested element. Until I found your answer. Are there further documentation you recommend.

Collectives™ on Stack Overflow

Extract Spark data frame from a nested structure

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related