3

We have two dataframes (note Scala syntax for illustrating),

val df1 = sc.parallelize(1 to 4).map(i => (i,i*10)).toDF("id","x")

val df2 = sc.parallelize(2 to 4).map(i => (i,i*100)).toDF("id","y") 

How to sum up one column from each frame so that we obtain this new dataframe,

+---+---------+
| id| x_plus_y|
+---+---------+
|  1|       10|
|  2|      220|
|  3|      330|
|  4|      440|
+---+---------+

Note Tried this, but it nullifies the first row,

sqlContext.sql("select df1.id, x+y as x_plus_y from df1 left join df2 on df1.id=df2.id").show
+---+--------+
| id|x_plus_y|
+---+--------+
|  1|    null|
|  2|     220|
|  3|     330|
|  4|     440|
+---+--------+

3 Answers 3

4

You don't even need to use an UDF for that :

val df3 = df1.as('a).join(df2.as('b), $"a.id" === $"b.id","left").
               select(df1("id"),'x,'y,(coalesce('x, lit(0)) + coalesce('y, lit(0))).alias("x_plus_y")).na.fill(0)

df3.show
// df3: org.apache.spark.sql.DataFrame = [id: int, x: int, y: int, x_plus_y: int]
// +---+---+---+--------+
// | id|  x|  y|x_plus_y|
// +---+---+---+--------+
// |  1| 10|  0|      10|
// |  2| 20|200|     220|
// |  3| 30|300|     330|
// |  4| 40|400|     440|
// +---+---+---+--------+
Sign up to request clarification or add additional context in comments.

Comments

4
df3 = df1.join(df2, df1.id == df2.id, "left_outer").select(df1.id, df1.x, df2.y).fillna(0)
df3.select("id", (df3.x + df3.y).alias("x_plus_y")).show()

This works in Python.

Comments

0

In Scala noticed this solution,

val d = sqlContext.sql("""
  select df1.id, x, y from df1 left join df2 on df1.id=df2.id""").na.fill(0)

to join the frames and replace non available values with zeroes, and then define this UDF,

import org.apache.spark.sql.functions
import org.apache.spark.sql.functions._

val plus: (Int,Int) => Int = (x:Int,y:Int) => x+y
val plus_udf = udf(plus)

d.withColumn("x_plus_y", plus_udf($"x", $"y")).show
+---+---+---+--------+
| id|  x|  y|x_plus_y|
+---+---+---+--------+
|  1| 10|  0|      10|
|  2| 20|200|     220|
|  3| 30|300|     330|
|  4| 40|400|     440|
+---+---+---+--------+

1 Comment

what if x or y column have null value?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.