16

How can I sum multiple columns in Spark? For example, in SparkR the following code works to get the sum of one column, but if I try to get the sum of both columns in df, I get an error.

# Create SparkDataFrame
df <- createDataFrame(faithful)

# Use agg to sum total waiting times
head(agg(df, totalWaiting = sum(df$waiting)))
##This works

# Use agg to sum total of waiting and eruptions
head(agg(df, total = sum(df$waiting, df$eruptions)))
##This doesn't work

Either SparkR or PySpark code will work.

7 Answers 7

62

For PySpark, if you don't want to explicitly type out the columns:

from operator import add
from functools import reduce
new_df = df.withColumn('total',reduce(add, [F.col(x) for x in numeric_col_list]))
Sign up to request clarification or add additional context in comments.

3 Comments

Why this tool is not in spark api?
That is a useful technique, and surely will help many people who google this question, but not what the original question asked about :) (it asked about an aggregation, not a row operation)
The original question was confusing aggregation (summing rows) with calculated fields (in this case summing columns).
9

you can do something like the below in pyspark

>>> from pyspark.sql import functions as F
>>> df = spark.createDataFrame([("a",1,10), ("b",2,20), ("c",3,30), ("d",4,40)], ["col1", "col2", "col3"])
>>> df.groupBy("col1").agg(F.sum(df.col2+df.col3)).show()
+----+------------------+
|col1|sum((col2 + col3))|
+----+------------------+
|   d|                44|
|   c|                33|
|   b|                22|
|   a|                11|
+----+------------------+

1 Comment

Yes but... 1 + NULL = NULL.
5
org.apache.spark.sql.functions.sum(Column e)

Aggregate function: returns the sum of all values in the expression.

As you can see, sum takes just one column as input so sum(df$waiting, df$eruptions) wont work.Since you wan to sum up the numeric fields, you can dosum(df("waiting") + df("eruptions")).If you wan to sum up values for individual columns then, you can df.agg(sum(df$waiting),sum(df$eruptions)).show

3 Comments

For me, this one worked df.withColumn("newCol", col("col1")+col("col2"))
@Ali yes that is also an alternative.
The original question as I understood it is about aggregation: summing columns "vertically" (for each column, sum all the rows), not a row operation: summing rows "horizontally" (for each row, sum the values in columns on that row).
4

You can use expr():

import pyspark.sql.functions as f

numeric_cols = ['col_a','col_b','col_c']
df = df.withColumn('total', f.expr('+'.join(cols)))

PySpark expr() is a SQL function to execute SQL-like expressions.

1 Comment

it should be df = df.withColumn('total', f.expr('+'.join(numeric_cols)))
3

sparkR code:

library(SparkR)
df <- createDataFrame(sqlContext,faithful)
w<-agg(df,sum(df$waiting)),agg(df,sum(df$eruptions))
head(w[[1]])
head(w[[2]])

Comments

1

The accepted answer was helpful for me, but I found out the one below is simpler and it does not use external API.

sum_df = df.withColumn('total', lit(0))
for c in col_list:
    sum_df = sum_df.withColumn('total', col('total') + col(c))

Comments

0

You can do

from pyspark.sql.types import StructType, StructField, FloatType

schema = StructType([
    StructField("col1", FloatType(), True),
    StructField("col2", FloatType(), True),
    StructField("col3", FloatType(), True),
    StructField("col4", FloatType(), True),
    StructField("col5", FloatType(), True)
])

data = [
    (1.1, 2.2, 3.3, 4.4, 5.5),
    (6.6, 7.7, 8.8, 9.9, 10.0)
]
df = spark.createDataFrame(data, schema)
numeric_col_list = df.columns

df = df.withColumn(
            'total',
            sum(
                [F.col(x) for x in numeric_col_list], 
                start=F.lit(0)
            )
        )
df.show()

or simply

sum(F.col(x) for x in numeric_col_list)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.