How to read such a nested multiline json file into a data frame with Spark/Scala

Question

I have the following json:

{
    "value":[
            {"C1":"val1","C2":"val2"},
            {"C1":"val1","C2":"val2"},
            {"C1":"val1","C2":"val2"}
        ]
}

That i am trying to read like this:

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/Projects.json")
  .show(10)

But it is not able to show me my records properly in the data frame, how do I go around that "value" nesting to properly have my rows in the dataframe?

Current result:

The result I am trying to get is:

    C1   |   C2
-------------------
    VAL1 |   VAL2
    VAL1 |   VAL2
    ...etc

I want a data frame that shows columns: C1, C2 I added a sample to my question :) — CoolStraw
– CoolStraw, Commented Mar 11, 2021 at 13:54
I had some time to look more closely to your question. Guess it is easier to just use Spark's SQL built-in functions. — Michael Heil
– Michael Heil, Commented Mar 11, 2021 at 16:02

Michael Heil · Accepted Answer · 2021-03-11 16:00:41Z

1

Looking at the schema of the Dataframe (jsonDf) returned by spark.read:

jsonDf.printSchema()
root
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- C1: string (nullable = true)
 |    |    |-- C2: string (nullable = true)

you could use the sql function explode and then select the two elements C1 and C2 as shown below:

  val df = jsonDf
    .withColumn("parsedJson", explode(col("value")))
    .withColumn("C1", col("parsedJson.C1"))
    .withColumn("C2", col("parsedJson.C2"))
    .select(col("C1"), col("C2"))
    .show(false)

This leads to the required outcome:

+----+----+
|C1  |C2  |
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+

answered Mar 11, 2021 at 16:00

Michael Heil

18.8k6 gold badges55 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ss301 Over a year ago

my column has a string value like [{"C1":"val1","C2":"val2"},{"C1":"val1","C2":"val2"},{"C1":"val1","C2":"val2"}] How do i convert this in the above format?

CoolStraw · Accepted Answer · 2021-03-11 14:36:22Z

I finally managed to find a solution to my problem using the following function:

  def flattenDataframe(df: DataFrame): DataFrame = {

    val fields = df.schema.fields
    val fieldNames = fields.map(x => x.name)
    val length = fields.length
    
    for(i <- 0 to fields.length-1){
      val field = fields(i)
      val fieldtype = field.dataType
      val fieldName = field.name
      fieldtype match {
        case arrayType: ArrayType =>
          val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
          val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
         // val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
          val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
          return flattenDataframe(explodedDf)
        case structType: StructType =>
          val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
          val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
          val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
         val explodedf = df.select(renamedcols:_*)
          return flattenDataframe(explodedf)
        case _ =>
      }
    }
    df
  }

Source https://medium.com/@saikrishna_55717/flattening-nested-data-json-xml-using-apache-spark-75fa4c8ea2a7

mck · Accepted Answer · 2021-03-11 16:18:12Z

0

Using inline will do the job:

val df = spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/Projects.json")

val df2 = df.selectExpr("inline(value)")
df2.show
+----+----+
|  C1|  C2|
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+

answered Mar 11, 2021 at 16:18

mck

42.7k13 gold badges44 silver badges62 bronze badges

Collectives™ on Stack Overflow

How to read such a nested multiline json file into a data frame with Spark/Scala

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related