4

say I have a dataframe like this

name age city
abc   20  A
def   30  B

i want to add a summary row at the end of the dataframe, so result will be like

name age city
abc   20  A
def   30  B
All   50  All

So String 'All', I can easily put, but how to get sum(df['age']) ###column object is not iterable

data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
 #|-- name: string (nullable = true)
 #|-- age: long (nullable = true)
 #|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns))  ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error.   If i am using [('All',50,'All')], it is doing fine. 

I usually work on Pandas dataframe and new to Spark. Might be my undestanding about spark dataframe is not that matured.

Please suggest, how to get the sum over a dataframe-column in pyspark. And if there is any better way to add/append a row to end of a dataframe. Thanks.

2
  • Currently I am solving the above requirement by "sum_value = int(data.agg({'age':'sum'}).toPandas()['sum(age)'].sum())", i:e applying agg sum on column->converting to pandas df->on column/series applying sum function.. But i don't want to involve Pandas here. Commented Sep 15, 2016 at 17:38
  • thats the way I would choose: df.limit(20).agg(F.sum('count')).show() Commented Sep 21, 2020 at 19:34

2 Answers 2

15

Spark SQL has a dedicated module for column functions pyspark.sql.functions.
So the way it works is:

from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])

res = data.unionAll(
    data.select([
        F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
        F.sum(data.age).alias('age'), # get the sum of 'age'
        F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
    ]))
res.show()

Prints:

+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20|   A|
| def| 30|   B|
| All| 50| All|
+----+---+----+
Sign up to request clarification or add additional context in comments.

Comments

4

A dataframe is immutable, you need to create a new one. To get the sum of your age, you can use this function: data.rdd.map(lambda x: float(x["age"])).reduce(lambda x, y: x+y)

The way you add a row is fine, but why would you do such a thing? Your dataframe will be hard to manipulate and you wont be able to use aggregations functions unless you drop the last line.

2 Comments

@GwydionFR-Actually the dataframe above is the final dataframe for a report, and in that i was intend to add a summary at last line. So I am not supposed to do anything later on that result df. Thanks for the answer.
Noted your advice..Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.