0

Creating a multiple columns from array column

Dataframe

Car name |  details
Toyota   | [[year,2000],[price,20000]]
Audi     | [[mpg,22]]

Expected dataframe

Car name | year | price | mpg
Toyota   | 2000 | 20000 | null
Audi     | null | null | 22
2
  • is (year, 2000) of type tuple or map ? can you print the schema of the dataframe ? Commented Sep 29, 2019 at 9:04
  • Since your question is not related to the spark-streaming, I discard that. Commented Sep 29, 2019 at 9:31

2 Answers 2

1

You can try this

Let's define the data

scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]

scala> carsDF.show(false)
+------+-----------------------------+
|car   |details                      |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi  |[[mpg,22]]                   |
+------+-----------------------------+

Splitting the data & accessing the values in the data

scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
|   car|  col|   val|
+------+-----+------+
|toyota| year|  2000|
|toyota|price|100000|
|  Audi|  mpg|    22|
+------+-----+------+

Define the list of columns that are required

scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)

Use pivoting on the above defined column names gives required output. By giving new column names in the sequence makes it a single point input

scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
|   car| mpg|   price|  year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
|  Audi|22.0|    null|  null| null|
+------+----+--------+------+-----+

This seems more elegant & easy way to achieve the output

Sign up to request clarification or add additional context in comments.

Comments

1

you can do it like that

import org.apache.spark.functions.col
val df: DataFrame = Seq(
  ("toyota",Array(("year", 2000), ("price", 100000))),
  ("toyota",Array(("year", 2001)))
).toDF("car", "details")

 +------+-------------------------------+
 |car   |details                        |
 +------+-------------------------------+
 |toyota|[[year, 2000], [price, 100000]]|
 |toyota|[[year, 2001]]                 |
 +------+-------------------------------+

val newdf = df
  .withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
  .withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
  .drop("details")

newdf.show()
  +------+----+------+
  |   car|year| price|
  +------+----+------+
  |toyota|2000|100000|     
  |toyota|2001|  null|
  +------+----+------+

4 Comments

How can i do if year and price element are not in order.
you can add a when otherwise test inside the withColumn to see if the value is year or price
Hi @firsni , how can I make it dynamic I updated the question .
you need to have a Map instead of an array to make it dynamic

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.