Spark scala create multiple columns from array column

Question

Creating a multiple columns from array column

Dataframe

Car name |  details
Toyota   | [[year,2000],[price,20000]]
Audi     | [[mpg,22]]

Expected dataframe

Car name | year | price | mpg
Toyota   | 2000 | 20000 | null
Audi     | null | null | 22

is (year, 2000) of type tuple or map ? can you print the schema of the dataframe ? — firsni
– firsni, Commented Sep 29, 2019 at 9:04
Since your question is not related to the spark-streaming, I discard that. — Daeho Ro
– Daeho Ro, Commented Sep 29, 2019 at 9:31

Sarath Chandra Vema · Accepted Answer · 2019-09-30 13:23:54Z

You can try this

Let's define the data

scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]

scala> carsDF.show(false)
+------+-----------------------------+
|car   |details                      |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi  |[[mpg,22]]                   |
+------+-----------------------------+

Splitting the data & accessing the values in the data

scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
|   car|  col|   val|
+------+-----+------+
|toyota| year|  2000|
|toyota|price|100000|
|  Audi|  mpg|    22|
+------+-----+------+

Define the list of columns that are required

scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)

Use pivoting on the above defined column names gives required output. By giving new column names in the sequence makes it a single point input

scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
|   car| mpg|   price|  year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
|  Audi|22.0|    null|  null| null|
+------+----+--------+------+-----+

This seems more elegant & easy way to achieve the output

firsni · Accepted Answer · 2019-09-29 17:28:16Z

1

you can do it like that

import org.apache.spark.functions.col
val df: DataFrame = Seq(
  ("toyota",Array(("year", 2000), ("price", 100000))),
  ("toyota",Array(("year", 2001)))
).toDF("car", "details")

 +------+-------------------------------+
 |car   |details                        |
 +------+-------------------------------+
 |toyota|[[year, 2000], [price, 100000]]|
 |toyota|[[year, 2001]]                 |
 +------+-------------------------------+

val newdf = df
  .withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
  .withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
  .drop("details")

newdf.show()
  +------+----+------+
  |   car|year| price|
  +------+----+------+
  |toyota|2000|100000|     
  |toyota|2001|  null|
  +------+----+------+

edited Sep 29, 2019 at 17:28

answered Sep 29, 2019 at 10:55

firsni

9266 silver badges14 bronze badges

4 Comments

sam8686 Over a year ago

How can i do if year and price element are not in order.

firsni Over a year ago

you can add a when otherwise test inside the withColumn to see if the value is year or price

sam8686 Over a year ago

Hi @firsni , how can I make it dynamic I updated the question .

firsni Over a year ago

you need to have a Map instead of an array to make it dynamic

Collectives™ on Stack Overflow

Spark scala create multiple columns from array column

2 Answers 2

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related