1

I have dataframe with below schema. I want all the columns including the nested fields should be sorted alphabetically. I want it in scala spark.

root
 |-- metadata2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute2: string (nullable = true)
 |    |    |-- attribute1: string (nullable = true)
 |-- metadata3: string (nullable = true)
 |-- metadata1: struct (containsNull = true)
 |    |-- attribute2: string (nullable = true)
 |    |-- attribute1: string (nullable = true)

when I sort using schema.sortBy(_.name), I get below schema(the nested array and struct type fields are not sorted)

root
 |-- metadata1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute2: string (nullable = true)
 |    |    |-- attribute1: string (nullable = true)
 |-- metadata2: struct (containsNull = true)
 |    |-- attribute2: string (nullable = true)
 |    |-- attribute1: string (nullable = true)
 |-- metadata3: string (nullable = true)

The schema which I want is as below. (Even the columns inside the metadata1(ArrayType) and metadata2(StructType) should be sorted)

root
 |-- metadata1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attribute1: string (nullable = true)
 |    |    |-- attribute2: string (nullable = true)
 |-- metadata2: struct (containsNull = true)
 |    |-- attribute1: string (nullable = true)
 |    |-- attribute2: string (nullable = true)
 |-- metadata3: string (nullable = true)

Thanks in advance.

1 Answer 1

0

Version to StructType:

import spark.implicits._
import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType}

val schema = StructType(Seq(
  StructField("metadata2",       StructType(
    Seq(StructField("attribute2", StringType),
      StructField("attribute1", StringType)))),
  StructField("metadata3", StringType),
  StructField("metadata1", ArrayType(StringType)
  )
))

schema.foreach(println _)
//  StructField(metadata2,StructType(StructField(attribute2,StringType,true), StructField(attribute1,StringType,true)),true)
//  StructField(metadata3,StringType,true)
//  StructField(metadata1,ArrayType(StringType,true),true)


val schemaResult = schema.sortBy(_.name).map{c =>
  c.dataType match {
    case structType: StructType => StructField(c.name, StructType(structType.fields.sortBy(_.name)))
    case _ => c
  }
}

schemaResult.foreach(println _)
//  StructField(metadata1,ArrayType(StringType,true),true)
//  StructField(metadata2,StructType(StructField(attribute1,StringType,true), StructField(attribute2,StringType,true)),true)
//  StructField(metadata3,StringType,true)
println(schemaResult)
//  List(StructField(metadata1,ArrayType(StringType,true),true), StructField(metadata2,StructType(StructField(attribute1,StringType,true), StructField(attribute2,StringType,true)),true), StructField(metadata3,StringType,true))
Sign up to request clarification or add additional context in comments.

3 Comments

thanks for your solution, metadata2 is an Array of Struct type. In my use case I have multiple nested fields like Struct of Array of Struct and Array of Struct of Struct fields. The solution should sort until the deeply nested fields.
The linked solution does not work for structs which are nested in arrays

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.