Spark aggregate rows with custom function

Question

To make it simple, let's assume we have a dataframe containing the following data:

+----------+---------+----------+----------+
|firstName |lastName |Phone     |Address   |
+----------+---------+----------+----------+
|firstName1|lastName1|info1     |info2     |
|firstName1|lastName1|myInfo1   |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2   |
+----------+---------+----------+----------+

How can I merge all rows grouping by (firstName,lastName) and keep in the columns Phone and Address only data starting by "my" to get the following :

+----------+---------+----------+----------+
|firstName |lastName |Phone     |Address   |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1   |myInfo2   |
+----------+---------+----------+----------+

Maybe should I use agg function with a custom UDAF? But how can I implement it?

Note: I'm using Spark 2.2 along with Scala 2.11.

Anahcolus · Accepted Answer · 2018-09-28 14:46:00Z

2

You can use groupBy and collect_set aggregation function and use a udf function to filter in the first string that starts with "my"

import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)

df.groupBy("firstName ", "lastName")
  .agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
  .show(false)

which should give you

+----------+---------+-------+-------+
|firstName |lastName |Phone  |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+

I hope the answer is helpful

answered Sep 28, 2018 at 14:46

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

pasha701 · Accepted Answer · 2018-10-02 09:47:22Z

1

If only two columns involved, filtering and join can be used instead of UDF:

val df = List(
  ("firstName1", "lastName1", "info1", "info2"),
  ("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
  ("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")

val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))

val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
    .select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)

Output:

+----------+---------+-------+-------+
|firstName |lastName |Phone  |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+

For many columns, when only one row expected, such construction can be used:

  val columnsForSearch = List("Phone", "Address")
  val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
  df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)

Output is the same.

UDF with two parameters example:

  val twoParamFunc = (firstName: String, Phone: String) => firstName + ": " + Phone
  val twoParamUDF = udf(twoParamFunc)
  df.select(twoParamUDF($"firstName", $"Phone")).show(false)

edited Oct 2, 2018 at 9:47

answered Sep 28, 2018 at 14:24

pasha701

7,2171 gold badge17 silver badges22 bronze badges

3 Comments

error Over a year ago

It was just an example to make it simple, but in real my dataframe contains more than 40 cols ... But thank you anyway

error Over a year ago

Here we apply min function to get only one value. What about if I have multiple values for the same Column starting with "my" and I'm supposed to launch an exception or do some logic to choose which one I should keep ? (The logic is the same for all the columns)

pasha701 Over a year ago

For chose logic UDF is required, if "min" or similar ("max", etc.) are not applicable.

Collectives™ on Stack Overflow

Spark aggregate rows with custom function

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related