I have done some research on this question, and the result shows that mean, max, min functions ignore null values. Below is the experiment code and results.
Environment: Scala, Spark 1.6.1 Hadoop 2.6.0
import org.apache.spark.sql.{Row}
import org.apache.spark.sql.types.{DoubleType, IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext}
val row1 =Row("1", 2.4, "2016-12-21")
val row2 = Row("1", None, "2016-12-22")
val row3 = Row("2", None, "2016-12-23")
val row4 = Row("2", None, "2016-12-23")
val row5 = Row("3", 3.0, "2016-12-22")
val row6 = Row("3", 2.0, "2016-12-22")
val theRdd = sc.makeRDD(Array(row1, row2, row3, row4, row5, row6))
val schema = StructType(StructField("key", StringType, false) ::
StructField("value", DoubleType, true) ::
StructField("d", StringType, false) :: Nil)
val df = sqlContext.createDataFrame(theRdd, schema)
df.show()
df.agg(mean($"value"), max($"value"), min($"value")).show()
df.groupBy("key").agg(mean($"value"), max($"value"), min($"value")).show()
Output:
+---+-----+----------+
|key|value| d|
+---+-----+----------+
| 1| 2.4|2016-12-21|
| 1| null|2016-12-22|
| 2| null|2016-12-23|
| 2| null|2016-12-23|
| 3| 3.0|2016-12-22|
| 3| 2.0|2016-12-22|
+---+-----+----------+
+-----------------+----------+----------+
| avg(value)|max(value)|min(value)|
+-----------------+----------+----------+
|2.466666666666667| 3.0| 2.0|
+-----------------+----------+----------+
+---+----------+----------+----------+
|key|avg(value)|max(value)|min(value)|
+---+----------+----------+----------+
| 1| 2.4| 2.4| 2.4|
| 2| null| null| null|
| 3| 2.5| 3.0| 2.0|
+---+----------+----------+----------+
From the output you can see that the mean, max, min functions on column 'value' of group key='1' returns '2.4' instead of null which shows that the null values were ignored in these functions. However, if the column contains only null values then these functions will return null values.
.cast("double")on them if necessary.