Problem in converting MS-SQL Query to spark SQL

Question

I want to convert this basic SQL Query in Spark

select Grade, count(*) * 100.0 / sum(count(*)) over()
from StudentGrades
group by Grade

I have tried using windowing functions in spark like this

val windowSpec = Window.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)

df1.select(
$"Arrest"
).groupBy($"Arrest").agg(sum(count("*")) over windowSpec,count("*")).show()


+------+-------------------------------------------------------------------- 
----------+--------+
|Arrest|sum(count(1)) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING)|count(1)|
+------+-------------------------------------------------------------------- 
----------+--------+
|  true|                                                                        
665517|  184964|
| false|                                                                        
665517|  480553|
+------+------------------------------------------------------------------------------+--------+

But when I try dividing by count(*) it through's error

df1.select(
$"Arrest"
).groupBy($"Arrest").agg(count("*")/sum(count("*")) over 
windowSpec,count("*")).show()

It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;

My Question is when I'm already using count() inside sum() in the first query I'm not receiving any errors of using an aggregate function inside another aggregate function but why get error in the second one?

Functionally you want? Please provide input and expected output, snall will suffice. — Ged
– Ged, Commented Sep 5, 2018 at 10:51
this line count(*) * 100.0 / sum(count(*)) over() basically will calculate percentage in SQL, that is what I wanted to achieve. I have already got the output by first getting the total count in a variable and then using that to divide the individual count, but I just wanted to know how to write a similar query in spark SQL. — Pradeep yadav
– Pradeep yadav, Commented Sep 5, 2018 at 11:02
yes the output should be like count | percentage 184964 | 27.79 480553 | 72.21 — Pradeep yadav
– Pradeep yadav, Commented Sep 5, 2018 at 11:04

Ged · Accepted Answer · 2018-09-05 12:11:58Z

An example:

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._

val df = sc.parallelize(Seq(
   ("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
   ("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
    )).toDF("c1", "c2", "Val1", "Val2")

val df2 = df
  .groupBy("c1")
  .agg(sum("Val1").alias("sum"))
  .withColumn("fraction", col("sum") /  sum("sum").over())

df2.show

You will need to tailor to your own situation. E.g. count instead of sum. As follows:

val df2 = df
  .groupBy("c1")
  .agg(count("*"))
  .withColumn("fraction", col("count(1)") /  sum("count(1)").over())

returning:

+---+--------+-------------------+
| c1|count(1)|           fraction|
+---+--------+-------------------+
|  E|       1|0.16666666666666666|
|  B|       1|0.16666666666666666|
|  D|       1|0.16666666666666666|
|  C|       1|0.16666666666666666|
|  A|       2| 0.3333333333333333|
+---+--------+-------------------+

You can do x 100. I note the alias does not seem to work as per the sum, so worked around this and left comparison above. Again, you will need to tailor to your specifics, this is part of my general modules for research and such.

Collectives™ on Stack Overflow

Problem in converting MS-SQL Query to spark SQL

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related