1

I have the following json:

{
    "value":[
            {"C1":"val1","C2":"val2"},
            {"C1":"val1","C2":"val2"},
            {"C1":"val1","C2":"val2"}
        ]
}

That i am trying to read like this:

spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/Projects.json")
  .show(10)

But it is not able to show me my records properly in the data frame, how do I go around that "value" nesting to properly have my rows in the dataframe?

Current result: Current output

The result I am trying to get is:

    C1   |   C2
-------------------
    VAL1 |   VAL2
    VAL1 |   VAL2
    ...etc
4
  • How do you want it to look like? Commented Mar 11, 2021 at 13:50
  • I want a data frame that shows columns: C1, C2 I added a sample to my question :) Commented Mar 11, 2021 at 13:54
  • @mike any ideas? ^^ I'm stuck as hell Commented Mar 11, 2021 at 14:17
  • I had some time to look more closely to your question. Guess it is easier to just use Spark's SQL built-in functions. Commented Mar 11, 2021 at 16:02

3 Answers 3

1

Looking at the schema of the Dataframe (jsonDf) returned by spark.read:

jsonDf.printSchema()
root
 |-- value: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- C1: string (nullable = true)
 |    |    |-- C2: string (nullable = true)

you could use the sql function explode and then select the two elements C1 and C2 as shown below:

  val df = jsonDf
    .withColumn("parsedJson", explode(col("value")))
    .withColumn("C1", col("parsedJson.C1"))
    .withColumn("C2", col("parsedJson.C2"))
    .select(col("C1"), col("C2"))
    .show(false)

This leads to the required outcome:

+----+----+
|C1  |C2  |
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+
Sign up to request clarification or add additional context in comments.

1 Comment

my column has a string value like [{"C1":"val1","C2":"val2"},{"C1":"val1","C2":"val2"},{"C1":"val1","C2":"val2"}] How do i convert this in the above format?
0

I finally managed to find a solution to my problem using the following function:

  def flattenDataframe(df: DataFrame): DataFrame = {

    val fields = df.schema.fields
    val fieldNames = fields.map(x => x.name)
    val length = fields.length
    
    for(i <- 0 to fields.length-1){
      val field = fields(i)
      val fieldtype = field.dataType
      val fieldName = field.name
      fieldtype match {
        case arrayType: ArrayType =>
          val fieldNamesExcludingArray = fieldNames.filter(_!=fieldName)
          val fieldNamesAndExplode = fieldNamesExcludingArray ++ Array(s"explode_outer($fieldName) as $fieldName")
         // val fieldNamesToSelect = (fieldNamesExcludingArray ++ Array(s"$fieldName.*"))
          val explodedDf = df.selectExpr(fieldNamesAndExplode:_*)
          return flattenDataframe(explodedDf)
        case structType: StructType =>
          val childFieldnames = structType.fieldNames.map(childname => fieldName +"."+childname)
          val newfieldNames = fieldNames.filter(_!= fieldName) ++ childFieldnames
          val renamedcols = newfieldNames.map(x => (col(x.toString()).as(x.toString().replace(".", "_"))))
         val explodedf = df.select(renamedcols:_*)
          return flattenDataframe(explodedf)
        case _ =>
      }
    }
    df
  }

Source https://medium.com/@saikrishna_55717/flattening-nested-data-json-xml-using-apache-spark-75fa4c8ea2a7

Comments

0

Using inline will do the job:

val df = spark.read
  .option("multiLine", true).option("mode", "PERMISSIVE")
  .json("/Projects.json")

val df2 = df.selectExpr("inline(value)")
df2.show
+----+----+
|  C1|  C2|
+----+----+
|val1|val2|
|val1|val2|
|val1|val2|
+----+----+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.