how to count values in columns for identical elements

Question

I have a dataframe:

+------------+------------+-------------+
|          id|     column1|      column2|
+------------+------------+-------------+
|           1|           1|            5|
|           1|           2|            5|
|           1|           3|            5|
|           2|           1|           15|
|           2|           2|            5|
|           2|           6|            5|
+------------+------------+-------------+

How to get the maximum value of column 1? And how to get the sum of the values in column 2? To get this result:

+------------+------------+-------------+
|          id|     column1|      column2|
+------------+------------+-------------+
|           1|           3|           15|
|           2|           6|           25|
+------------+------------+-------------+

notNull · Accepted Answer · 2020-05-06 21:41:26Z

3

Use .groupBy and agg (max(column1),sum(column2)) for this case

#sample data
df=spark.createDataFrame([(1,1,5),(1,2,5),(1,3,5),(2,1,15),(2,2,5),(2,6,5)],["id","column1","column2"])

from pyspark.sql.functions import *

df.groupBy("id").\
agg(max("column1").alias("column1"),sum("column2").alias("column2")).\
show()
#+---+-------+-------+
#| id|column1|column2|
#+---+-------+-------+
#|  1|      3|     15|
#|  2|      6|     25|
#+---+-------+-------+

answered May 6, 2020 at 21:41

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Aravind Yarram · Accepted Answer · 2020-05-06 21:42:25Z

2

All you need is groupBy to group corresponding values of id and use aggregate functions sum and max with agg

The functions come from org.apache.spark.sql.functions._ package.

import spark.implicits._
import org.apache.spark.sql.functions._

val input = Seq(
        (1, 1, 5),
        (1, 2, 5),
        (1, 3, 5),
        (2, 1, 15),
        (2, 2, 5),
        (2, 6, 5)
   ).toDF("id", "col1", "col2")

val result = input
        .groupBy("id")
        .agg(max(col("col1")),sum(col("col2")))
        .show()

answered May 6, 2020 at 21:42

Aravind Yarram

80.5k49 gold badges239 silver badges335 bronze badges

Comments

Ram Ghadiyaram · Accepted Answer · 2020-05-06 22:04:05Z

2

If you are familiar with sql then below is the sql version using group by , max and sum functions

import spark.implicits._
  import org.apache.spark.sql.functions._

  val input = Seq(
    (1, 1, 5),
    (1, 2, 5),
    (1, 3, 5),
    (2, 1, 15),
    (2, 2, 5),
    (2, 6, 5)
  ).toDF("id", "col1", "col2").createTempView("mytable")

spark.sql("select id,max(col1),sum(col2) from mytable group by id").show

Result :

+---+---------+---------+
| id|max(col1)|sum(col2)|
+---+---------+---------+
|  1|        3|       15|
|  2|        6|       25|
+---+---------+---------+

answered May 6, 2020 at 22:04

Ram Ghadiyaram

29.4k16 gold badges102 silver badges133 bronze badges

Collectives™ on Stack Overflow

how to count values in columns for identical elements

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related