0

I have a dataframe:

+------------+------------+-------------+
|          id|     column1|      column2|
+------------+------------+-------------+
|           1|           1|            5|
|           1|           2|            5|
|           1|           3|            5|
|           2|           1|           15|
|           2|           2|            5|
|           2|           6|            5|
+------------+------------+-------------+

How to get the maximum value of column 1? And how to get the sum of the values in column 2? To get this result:

+------------+------------+-------------+
|          id|     column1|      column2|
+------------+------------+-------------+
|           1|           3|           15|
|           2|           6|           25|
+------------+------------+-------------+

3 Answers 3

3

Use .groupBy and agg (max(column1),sum(column2)) for this case

#sample data
df=spark.createDataFrame([(1,1,5),(1,2,5),(1,3,5),(2,1,15),(2,2,5),(2,6,5)],["id","column1","column2"])

from pyspark.sql.functions import *

df.groupBy("id").\
agg(max("column1").alias("column1"),sum("column2").alias("column2")).\
show()
#+---+-------+-------+
#| id|column1|column2|
#+---+-------+-------+
#|  1|      3|     15|
#|  2|      6|     25|
#+---+-------+-------+
Sign up to request clarification or add additional context in comments.

Comments

2

All you need is groupBy to group corresponding values of id and use aggregate functions sum and max with agg

The functions come from org.apache.spark.sql.functions._ package.

import spark.implicits._
import org.apache.spark.sql.functions._

val input = Seq(
        (1, 1, 5),
        (1, 2, 5),
        (1, 3, 5),
        (2, 1, 15),
        (2, 2, 5),
        (2, 6, 5)
   ).toDF("id", "col1", "col2")

val result = input
        .groupBy("id")
        .agg(max(col("col1")),sum(col("col2")))
        .show()

Comments

2

If you are familiar with sql then below is the sql version using group by , max and sum functions

import spark.implicits._
  import org.apache.spark.sql.functions._

  val input = Seq(
    (1, 1, 5),
    (1, 2, 5),
    (1, 3, 5),
    (2, 1, 15),
    (2, 2, 5),
    (2, 6, 5)
  ).toDF("id", "col1", "col2").createTempView("mytable")

spark.sql("select id,max(col1),sum(col2) from mytable group by id").show

Result :

+---+---------+---------+
| id|max(col1)|sum(col2)|
+---+---------+---------+
|  1|        3|       15|
|  2|        6|       25|
+---+---------+---------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.