python, pyspark : get sum of a pyspark dataframe column values

Question

say I have a dataframe like this

name age city
abc   20  A
def   30  B

i want to add a summary row at the end of the dataframe, so result will be like

name age city
abc   20  A
def   30  B
All   50  All

So String 'All', I can easily put, but how to get sum(df['age']) ###column object is not iterable

data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
 #|-- name: string (nullable = true)
 #|-- age: long (nullable = true)
 #|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns))  ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error.   If i am using [('All',50,'All')], it is doing fine.

I usually work on Pandas dataframe and new to Spark. Might be my undestanding about spark dataframe is not that matured.

Please suggest, how to get the sum over a dataframe-column in pyspark. And if there is any better way to add/append a row to end of a dataframe. Thanks.

Currently I am solving the above requirement by "sum_value = int(data.agg({'age':'sum'}).toPandas()['sum(age)'].sum())", i:e applying agg sum on column->converting to pandas df->on column/series applying sum function.. But i don't want to involve Pandas here. — Satya
– Satya, Commented Sep 15, 2016 at 17:38
thats the way I would choose: df.limit(20).agg(F.sum('count')).show() — pabloverd
– pabloverd, Commented Sep 21, 2020 at 19:34

swenzel · Accepted Answer · 2016-09-16 12:19:00Z

15

Spark SQL has a dedicated module for column functions pyspark.sql.functions.
So the way it works is:

from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])

res = data.unionAll(
    data.select([
        F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
        F.sum(data.age).alias('age'), # get the sum of 'age'
        F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
    ]))
res.show()

Prints:

+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20|   A|
| def| 30|   B|
| All| 50| All|
+----+---+----+

answered Sep 16, 2016 at 12:19

swenzel

7,2633 gold badges26 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

GwydionFR · Accepted Answer · 2016-09-16 11:54:01Z

4

A dataframe is immutable, you need to create a new one. To get the sum of your age, you can use this function: data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)

The way you add a row is fine, but why would you do such a thing? Your dataframe will be hard to manipulate and you wont be able to use aggregations functions unless you drop the last line.

answered Sep 16, 2016 at 11:54

GwydionFR

7871 gold badge10 silver badges27 bronze badges

2 Comments

Satya Over a year ago

@GwydionFR-Actually the dataframe above is the final dataframe for a report, and in that i was intend to add a summary at last line. So I am not supposed to do anything later on that result df. Thanks for the answer.

Satya Over a year ago

Noted your advice..Thanks.

Collectives™ on Stack Overflow

python, pyspark : get sum of a pyspark dataframe column values

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related