I want to convert this basic SQL Query in Spark
select Grade, count(*) * 100.0 / sum(count(*)) over()
from StudentGrades
group by Grade
I have tried using windowing functions in spark like this
val windowSpec = Window.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(sum(count("*")) over windowSpec,count("*")).show()
+------+--------------------------------------------------------------------
----------+--------+
|Arrest|sum(count(1)) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING)|count(1)|
+------+--------------------------------------------------------------------
----------+--------+
| true|
665517| 184964|
| false|
665517| 480553|
+------+------------------------------------------------------------------------------+--------+
But when I try dividing by count(*) it through's error
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(count("*")/sum(count("*")) over
windowSpec,count("*")).show()
It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
My Question is when I'm already using count() inside sum() in the first query I'm not receiving any errors of using an aggregate function inside another aggregate function but why get error in the second one?
count(*) * 100.0 / sum(count(*)) over()basically will calculate percentage in SQL, that is what I wanted to achieve. I have already got the output by first getting the total count in a variable and then using that to divide the individual count, but I just wanted to know how to write a similar query in spark SQL.count | percentage 184964 | 27.79 480553 | 72.21