I have the following Spark SQL query:
val subquery =
"( select garment_group_name , prod_name, " +
"row_number() over (partition by garment_group_name order by count(prod_name) desc) as seqnum " +
"from articles a1 " +
"group by garment_group_name, prod_name )"
val query = "SELECT garment_group_name, prod_name " +
"FROM " + subquery +
" WHERE seqnum = 1 "
val query3 = spark.sql(query)
I am trying to do that exact same thing however as a Data frame API. I wanted to just first concentrate on the subquery part and I did something like this
import org.apache.spark.sql.expressions.Window // imports the needed Window object
import org.apache.spark.sql.functions.row_number
val windowSpec = Window.partitionBy("garment_group_name")
articlesDF.withColumn("row_number", row_number.over(windowSpec))
.show()
However I get the following error
org.apache.spark.sql.AnalysisException: Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:42)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:95)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowOrder$$anonfun$apply$33.applyOrElse(Analyzer.scala:2207)......... and so on.
I see that I need to include an orderBy clause but how can I do this if I am actually first counting from a group by on two columns and then comes an order by?
The warning gives the example: SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table, but I do not know how to do this as a data frame API and I don't see this online.