1

I'm using pyspark, and I have data like this:

col1 col2 col3
1 0 1
1 1 0
1 1 0
1 0 0

My desired output is:

col sum
col1 4
col2 2
col3 1

My first thought was to put the column names in a list, loop through it, and each time sum that column and union the results to a new df.

Then I thought, maybe it's possible to do multiple aggregations, e.g.:

df.agg(sum('col1), sum('col2))

... and then figure out a way to unpivot.

Is there an easier way?

3 Answers 3

2

There is no easier way as far as I know. You can unpivot it after aggregating, either by first converting it to a Pandas dataframe and then invoking transpose on it or creating a map and then exploding the map to get the result as col and sum column.

# Assuming initial dataframe is df
aggDF = df.agg(*[F.sum(F.col(col_name)).alias(col_name) for col_name in df.columns])

# Using pandas
aggDF.toPandas().transpose().reset_index().rename({'index' : 'col', 0: 'sum'}, axis=1)

# Going spark all the way
aggDF.withColumn("col", F.create_map([e for col in aggDF.columns for e in (F.lit(col), F.col(col))])).selectExpr("explode(col) as (col, sum)").show()

# Both return
"""
+----+---+
| col|sum|
+----+---+
|col1|  4|
|col2|  2|
|col3|  1|
+----+---+
"""

This works for more than 3 columns, if required.

Sign up to request clarification or add additional context in comments.

Comments

1

You can use stack SQL function to unpivot a dataframe, as described here. So your code would become, with input as your input dataframe:

from pyspark.sql import functions as F

output = input.agg(
  F.sum("col1").alias("col1"),
  F.sum("col2").alias("col2"),
  F.sum("col3").alias("col3")
).select(
  F.expr("stack(3, 'col1', col1, 'col2', col2, 'col3', col3) as (col,sum)")
)

If you have the following input dataframe:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|1   |0   |1   |
|1   |1   |0   |
|1   |1   |0   |
|1   |0   |0   |
+----+----+----+

You will get the following output dataframe:

+----+---+
|col |sum|
+----+---+
|col1|4  |
|col2|2  |
|col3|1  |
+----+---+

Comments

1

You can first sum each column:

// input
val df = List((1,0,1),(1,1,0),(1,1,0),(1,0,0)).toDF("col1", "col2", "col3")
df.show

// sum each column
val sums = df.agg(sum("coL1").as("col1"), sum("col2").as("col2"), 
sum("col3").as("col3"))
sums.show 

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   4|   2|   1|
+----+----+----+

This gives you a DS with one row, and 3 columns. Which you can easily collect. And if that's what you want, create a new dataset with:

val sumRow = sums.first

val sumDS = List("col1" -> sumRow.getAs[Long]("col1"), "col2" -> 
sumRow.getAs[Long]("col2"), "col3" -> sumRow.getAs[Long]("col3")).toDF("col", "sum")
sumDS.show

+----+---+
| col|sum|
+----+---+
|col1|  4|
|col2|  2|
|col3|  1|
+----+---+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.