Flatten nested json in Scala Spark Dataframe

Question

I have multiple jsons coming from any restapi's and I don't know the schema of it. I am unable to use the explode function of dataframes , because i am unaware about the column names, which is getting created by spark api.

1.Can we store the keys of the nested arrays elements keys by decoding values from dataframe.schema.fields, As spark only provides the value part in the rows of the dataframe and take the top level key as column name.

Dataframe --

+--------------------+
|       stackoverflow|
+--------------------+
|[[[Martin Odersky...|
+--------------------+

Is there any optimal way to flatten the json by using the dataframe methods via determining the schema at the run time.

Sample Json -:

{
  "stackoverflow": [{
    "tag": {
      "id": 1,
      "name": "scala",
      "author": "Martin Odersky",
      "frameworks": [
        {
          "id": 1,
          "name": "Play Framework"
        },
        {
          "id": 2,
          "name": "Akka Framework"
        }
      ]
    }
  },
    {
      "tag": {
        "id": 2,
        "name": "java",
        "author": "James Gosling",
        "frameworks": [
          {
            "id": 1,
            "name": "Apache Tomcat"
          },
          {
            "id": 2,
            "name": "Spring Boot"
          }
        ]
      }
    }
  ]
}

Note - We need to do all the operations in dataframe , because there is a huge amount of data , that is coming and we cannot parse each and every json.

s.polam · Accepted Answer · 2021-03-23 04:48:43Z

17

Try to avoid flattening all columns as much as possible.

Created helper function & You can directly call df.explodeColumns on DataFrame.

Below code will flatten multi level array & struct type columns.

scala> :paste
// Entering paste mode (ctrl-D to finish)

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try

implicit class DFHelpers(df: DataFrame) {
    def columns = {
      val dfColumns = df.columns.map(_.toLowerCase)
      df.schema.fields.flatMap { data =>
        data match {
          case column if column.dataType.isInstanceOf[StructType] => {
            column.dataType.asInstanceOf[StructType].fields.map { field =>
              val columnName = column.name
              val fieldName = field.name
              col(s"${columnName}.${fieldName}").as(s"${columnName}_${fieldName}")
            }.toList
          }
          case column => List(col(s"${column.name}"))
        }
      }
    }

    def flatten: DataFrame = {
      val empty = df.schema.filter(_.dataType.isInstanceOf[StructType]).isEmpty
      empty match {
        case false =>
          df.select(columns: _*).flatten
        case _ => df
      }
    }
    def explodeColumns = {
      @tailrec
      def columns(cdf: DataFrame):DataFrame = cdf.schema.fields.filter(_.dataType.typeName == "array") match {
        case c if !c.isEmpty => columns(c.foldLeft(cdf)((dfa,field) => {
          dfa.withColumn(field.name,explode_outer(col(s"${field.name}"))).flatten
        }))
        case _ => cdf
      }
      columns(df.flatten)
    }
}

// Exiting paste mode, now interpreting.

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try
defined class DFHelpers

Flattened Columns

scala> df.printSchema
root
 |-- stackoverflow: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- tag: struct (nullable = true)
 |    |    |    |-- author: string (nullable = true)
 |    |    |    |-- frameworks: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- id: long (nullable = true)
 |    |    |    |-- name: string (nullable = true)


scala> df.explodeColumns.printSchema
root
 |-- author: string (nullable = true)
 |-- frameworks_id: long (nullable = true)
 |-- frameworks_name: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

scala>

edited Mar 23, 2021 at 4:48

answered May 18, 2020 at 6:27

s.polam

10.4k2 gold badges17 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

14 Comments

s.polam Over a year ago

are you facing any issue ?

Cdr Over a year ago

@Srinivas- I'm trying to accomplish the same kind of scenario with spark structured streaming. Reading jsonstring from kafka with dynamic schema(each record can have different schema;one record can have 5 columns, other record may have 3 columns). It's a nested one, im flattening that and loading that to hbase using foreachwriter. But my real problem is schema inference, without schema. Spark doesn't support it. Any leads will be greatly helpful.Thanks

Cdr Over a year ago

From Kafka, single column called value, which contains jsonstring will be read through streaming. Schema will not be defined.It will tend to change.We need to parse & flatten that jsonstring before loading to hbase.

jack Over a year ago

@Srinivas Hi, your approach is impressive. Just curious why this kind of functionality is not supported natively in Spark (i think this is a very common use case), do you have any idea?

ungalVicky Over a year ago

@Srinivas thanks a lot. Was looking for tail recursive approach of spark dataframe flatten. This dataframe extension works very charm.

|

Fedor · Accepted Answer · 2024-12-05 11:01:55Z

0

def flateJSON(df:DataFrame) : DataFrame = {
var temDf = df;
for(i <- temDf.columns){
  val colType = temDf.schema(i).dataType
  colType match {
    case arrayType : ArrayType =>{
      temDf = temDf.withColumn(s"$i",explode_outer(col(i)));
      return flateJSON(temDf);
    }
    case structType: StructType =>{
      temDf = temDf.select(s"$i.*","*").drop(s"$i");
      return flateJSON(temDf);
    }
    case _ =>
  }
}
temDf
}

edited Dec 5, 2024 at 11:01

Fedor

24.7k45 gold badges59 silver badges188 bronze badges

answered Dec 5, 2024 at 9:04

kartik lohate

13 bronze badges

Collectives™ on Stack Overflow

Flatten nested json in Scala Spark Dataframe

2 Answers 2

14 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

14 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related