What is the efficient way to create Spark DataFrame in Scala with array type columns from another DataFrame that does not have an array column?

Question

Suppose, I have the following the dataframe:

 id | col1 | col2 
-----------------
 x  |  p1  |  a1  
-----------------
 x  |  p2  |  b1
-----------------
 y  |  p2  |  b2
-----------------
 y  |  p2  |  b3
-----------------
 y  |  p3  |  c1

The distinct values from col1 which are (p1, p2, p3) alone with id will be used as columns for the final dataframe. Here, the id y has two col2 values (b2 and b3) for the same col1 value p2, so, p2 will be treated as an array type column. Therefore, the final dataframe will be

  id  |   p1   |   p2   |   p3
--------------------------------
  x   |   a1   |  [b1]  |  null
--------------------------------
  y   |  null  |[b2, b3]|  c1

How can I achieve the second dataframe efficiently from the first dataframe?

akuiper · Accepted Answer · 2019-11-17 21:54:04Z

1

You are basically looking for table pivoting; for your case, groupBy id, pivot col1 as headers, and aggregate col2 as list using collect_list function:

df.groupBy("id").pivot("col1").agg(collect_list("col2")).show
+---+----+--------+----+
| id|  p1|      p2|  p3|
+---+----+--------+----+
|  x|[a1]|    [b1]|  []|
|  y|  []|[b2, b3]|[c1]|
+---+----+--------+----+

If it's guaranteed that there's at most one value in p1 and p3 for each id, you can convert those columns to String type by getting the first item of the array:

df.groupBy("id").pivot("col1").agg(collect_list("col2"))
  .withColumn("p1", $"p1"(0)).withColumn("p3", $"p3"(0))
  .show
+---+----+--------+----+
| id|  p1|      p2|  p3|
+---+----+--------+----+
|  x|  a1|    [b1]|null|
|  y|null|[b2, b3]|  c1|
+---+----+--------+----+

If you need to convert the column types dynamically, i.e. only use array type column types when you have to:

// get array Type columns
val arrayColumns = df.groupBy("id", "col1").agg(count("*").as("N"))
    .where($"N" > 1).select("col1").distinct.collect.map(row => row.getString(0))
// arrayColumns: Array[String] = Array(p2)

// aggregate / pivot data frame
val aggDf = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, p1: array<string> ... 2 more fields]

// get string columns
val stringColumns = aggDf.columns.filter(x => x != "id" && !arrayColumns.contains(x))

// use foldLeft on string columns to convert the columns to string type
stringColumns.foldLeft(aggDf)((df, x) => df.withColumn(x, col(x)(0))).show
+---+----+--------+----+
| id|  p1|      p2|  p3|
+---+----+--------+----+
|  x|  a1|    [b1]|null|
|  y|null|[b2, b3]|  c1|
+---+----+--------+----+

edited Nov 17, 2019 at 21:54

answered Nov 17, 2019 at 21:11

akuiper

216k33 gold badges363 silver badges380 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Abir Chokraborty Over a year ago

I want p1 and p3 to be string type but not an array type.

akuiper Over a year ago

I'm not really sure if that makes a lot of sense unless you are sure, there are at most one value in p1 and p3 for each id. In which case, you can extract the first element from p1 and p3. df.groupBy("id").pivot("col1").agg(collect_list("col2")).withColumn("p1", $"p1"(0)).withColumn("p3", $"p3"(0)).show should give what you need.

Abir Chokraborty Over a year ago

The thing is the distinct values from col1 are not fixed, they can be any number of elements.

akuiper Over a year ago

In that case I would keep all the pivoted columns as Array Type instead of String type. At the end of day, if you are not even sure how many columns you have, how do you know what type they should be in?

Abir Chokraborty Over a year ago

I can find the names of the Array Type columns from the first dataframe using SELECT DISTINCT col1 FROM (SELECT id, col1, COUNT (*) AS rowCount FROM df GROUP BY id , col1 HAVING rowCount > 1. The main problem is that creating the final dataframe dynamically using the name of Array Type columns.

|

Collectives™ on Stack Overflow

What is the efficient way to create Spark DataFrame in Scala with array type columns from another DataFrame that does not have an array column?

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related