scala spark, how do I merge a set of columns to a single one on a dataframe?

Question

I'm looking for a way to do this without a UDF, I am wondering if its possible. Lets say I have a DF as follows:

Buyer_name  Buyer_state  CoBuyer_name  CoBuyers_state  Price  Date
Bob         CA           Joe           CA              20     010119
Stacy       IL           Jamie         IL              50     020419
... about 3 millions more rows...

And I want to turn it to:

Buyer_name Buyer_state Price Date
Bob        CA          20    010119
Joe        CA          20    010119
Stacy      IL          50    020419
Jamie      IL          50    020419
...

Edit: I could also,

Create two dataframes, removing "Buyer" columns from one, and "Cobuyer" columns from the other.

Rename dataframe with "Cobuyer" columns as "Buyer" columns.

Concatenate both dataframes.

Leo C · Accepted Answer · 2019-02-20 02:10:48Z

4

You can group struct(Buyer_name, Buyer_state) and struct(CoBuyer_name, CoBuyer_state) into an Array which is then expanded using explode, as shown below:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
  ("Bob", "CA", "Joe", "CA", 20, "010119"),
  ("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")

df.
  withColumn("Buyers", array(
    struct($"Buyer_name".as("_1"), $"Buyer_state".as("_2")),
    struct($"CoBuyer_name".as("_1"), $"CoBuyer_state".as("_2"))
  )).
  withColumn("Buyer", explode($"Buyers")).
  select(
    $"Buyer._1".as("Buyer_name"), $"Buyer._2".as("Buyer_state"), $"Price", $"Date"
  ).show
// +----------+-----------+-----+------+
// |Buyer_name|Buyer_state|Price|  Date|
// +----------+-----------+-----+------+
// |       Bob|         CA|   20|010119|
// |       Joe|         CA|   20|010119|
// |     Stacy|         IL|   50|020419|
// |     Jamie|         IL|   50|020419|
// +----------+-----------+-----+------+

answered Feb 20, 2019 at 2:10

Leo C

22.5k3 gold badges28 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

wBob Over a year ago

Great answer Leo, I have added an alternative approach.

wBob · Accepted Answer · 2019-02-20 14:25:18Z

2

This sounds like an unpivot operation to me which can be accomplished with the union function in Scala:

val df = Seq(
  ("Bob", "CA", "Joe", "CA", 20, "010119"),
  ("Stacy", "IL", "Jamie", "IL", 50, "020419")
).toDF("Buyer_name", "Buyer_state", "CoBuyer_name", "CoBuyer_state", "Price", "Date")

val df_new = df.select("Buyer_name", "Buyer_state", "Price", "Date").union(df.select("CoBuyer_name", "CoBuyer_state", "Price", "Date"))

df_new.show

Thanks to Leo for providing the dataframe definition which I've re-used.

answered Feb 20, 2019 at 14:25

wBob

14.4k3 gold badges26 silver badges43 bronze badges

Collectives™ on Stack Overflow

scala spark, how do I merge a set of columns to a single one on a dataframe?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related