Spark SQL: Transforming some rows into columns

Question

Is there a way to transform a org.apache.spark.sql.DataFrame like this

Predictor  icaoCode  num1  num2
P1         OTHH       1.1   1.2
P1         ZGGG       2.1   2.2
P2         OTHH       3.1   3.2
P2         ZGGG       4.1   4.2
P3         OTHH       5.1   5.2
P3         ZGGG       6.1   6.2
.           .          .     .
.           .          .     .
.           .          .     .

into a DataFrame like this?

icaoCode  P1.num1  P1.num2  P2.num1  P2.num2  P3.num1  P3.num2  ...
OTHH          1.1      1.2      3.1      3.2      5.1      5.2  ...
ZGGG          2.1      2.2      4.1      4.2      6.1      6.2  ...
 .             .        .        .        .        .        .   ...    
 .             .        .        .        .        .        .   ...    
 .             .        .        .        .        .        .   ...

There can be an arbitrary number of values for Predictor and for icaoCode.

Sascha Vetter · Accepted Answer · 2016-10-17 21:31:58Z

With Spark 1.6.0, there is a pivot function to transform/transpose your data. In your case it requires some preprocessing to get the data ready for pivot. Here an example how I'd do it:

def doPivot(): Unit = {
  val sqlContext: SQLContext = new org.apache.spark.sql.SQLContext(sc)

  // dummy data
  val r1 = Input("P1", "OTHH", 1.1, 1.2)
  val r2 = Input("P1", "ZGGG", 2.1, 2.2)
  val r3 = Input("P2", "OTHH", 3.1, 3.2)

  val records = Seq(r1, r2, r3)
  val df = sqlContext.createDataFrame(records)

  // prepare data for pivot
  val fullName: ((String, String) => String) = (predictor: String, num: String) => {
    predictor + "." + num
  }
  val udfFullName = udf(fullName)
  val dfFullName = df.withColumn("num1-complete", udfFullName(col("predictor"), lit("num1")))
    .withColumn("num2-complete", udfFullName(col("predictor"), lit("num2")))

  val dfPrepared = dfFullName.select(col("icaoCode"), col("num1") as "num", col("num1-complete") as "value")
    .unionAll(dfFullName.select(col("icaoCode"), col("num2") as "num", col("num2-complete") as "value"))

  // transpose/pivot dataframe
  val dfPivoted = dfPrepared.groupBy(col("icaoCode")).pivot("value").mean("num")
  dfPivoted.show()
}

case class Input(predictor: String, icaoCode: String, num1: Double, num2: Double)

The final dataframe should work for you:

+--------+-------+-------+-------+-------+
|icaoCode|P1.num1|P1.num2|P2.num1|P2.num2|
+--------+-------+-------+-------+-------+
|    OTHH|    1.1|    1.2|    3.1|    3.2|
|    ZGGG|    2.1|    2.2|   null|   null|
+--------+-------+-------+-------+-------+

Collectives™ on Stack Overflow

Spark SQL: Transforming some rows into columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related