48

I have a Dataframe that I read from a CSV file with many columns like: timestamp, steps, heartrate etc.

I want to sum the values of each column, for instance the total number of steps on "steps" column.

As far as I see I want to use these kind of functions: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

But I can understand how to use the function sum.

When I write the following:

val df = CSV.load(args(0))
val sumSteps = df.sum("steps") 

the function sum cannot be resolved.

Do I use the function sum wrongly? Do Ι need to use first the function map? and if yes how?

A simple example would be very helpful! I started writing Scala recently.

5 Answers 5

116

You must first import the functions:

import org.apache.spark.sql.functions._

Then you can use them like this:

val df = CSV.load(args(0))
val sumSteps =  df.agg(sum("steps")).first.get(0)

You can also cast the result if needed:

val sumSteps: Long = df.agg(sum("steps").cast("long")).first.getLong(0)

Edit:

For multiple columns (e.g. "col1", "col2", ...), you could get all aggregations at once:

val sums = df.agg(sum("col1").as("sum_col1"), sum("col2").as("sum_col2"), ...).first

Edit2:

For dynamically applying the aggregations, the following options are available:

  • Applying to all numeric columns at once:
df.groupBy().sum()
  • Applying to a list of numeric column names:
val columnNames = List("col1", "col2")
df.groupBy().sum(columnNames: _*)
  • Applying to a list of numeric column names with aliases and/or casts:
val cols = List("col1", "col2")
val sums = cols.map(colName => sum(colName).cast("double").as("sum_" + colName))
df.groupBy().agg(sums.head, sums.tail:_*).show()
Sign up to request clarification or add additional context in comments.

Comments

27

If you want to sum all values of one column, it's more efficient to use DataFrame's internal RDD and reduce.

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Array(10,2,3,4)).toDF("steps")
df.select(col("steps")).rdd.map(_(0).asInstanceOf[Int]).reduce(_+_)

//res1 Int = 19

7 Comments

Nice option! Is it still more efficient if he wants the sum of many columns? In a dataframe i know it would be like df.agg(sum("col1"), sum("col2"), ...)
@DanieldePaula I know but he said one column
Oh, I read "I want to sum the values of each column (...)" and I thought he meant many columns. Anyway, my question was more out of curiosity, to help improving our answers.
@DanieldePaula Indeed your answer is the correct one, mine is just an alternative (for only one column), so I will vote for yours.
I set as the correct answer the second one, because I wanted the sum of the values of one column. However later I need the mean and other statistical methods, so I think I will use something similar in syntax based on answer 1.
|
10

Simply apply aggregation function, Sum on your column

df.groupby('steps').sum().show()

Follow the Documentation http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html

Check out this link also https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/

Comments

4

Not sure this was around when this question was asked but:

df.describe().show("columnName")

gives mean, count, stdtev stats on a column. I think it returns on all columns if you just do .show()

Comments

0

Using spark sql query..just incase if it helps anyone!

import org.apache.spark.sql.SparkSession 
import org.apache.spark.SparkConf 
import org.apache.spark.sql.functions._ 
import org.apache.spark.SparkContext 
import java.util.stream.Collectors

val conf = new SparkConf().setMaster("local[2]").setAppName("test")
val spark = SparkSession.builder.config(conf).getOrCreate()
val df = spark.sparkContext.parallelize(Seq(1, 2, 3, 4, 5, 6, 7)).toDF()

df.createOrReplaceTempView("steps")
val sum = spark.sql("select  sum(steps) as stepsSum from steps").map(row => row.getAs("stepsSum").asInstanceOf[Long]).collect()(0)
println("steps sum = " + sum) //prints 28

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.