I have a DataFrame generated as follows:
df.groupBy($"Hour", $"Category")
.agg(sum($"value").alias("TotalValue"))
.sort($"Hour".asc,$"TotalValue".desc))
The results look like:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
| 0| cat26| 30.9|
| 0| cat13| 22.1|
| 0| cat95| 19.6|
| 0| cat105| 1.3|
| 1| cat67| 28.5|
| 1| cat4| 26.8|
| 1| cat13| 12.6|
| 1| cat23| 5.3|
| 2| cat56| 39.6|
| 2| cat40| 29.7|
| 2| cat187| 27.9|
| 2| cat68| 9.8|
| 3| cat8| 35.6|
| ...| ....| ....|
+----+--------+----------+
I would like to make new dataframes based on every unique value of col("Hour") , i.e.
- for the group of Hour==0
- for the group of Hour==1
- for the group of Hour==2 and so on...
So the desired output would be:
df0 as:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
| 0| cat26| 30.9|
| 0| cat13| 22.1|
| 0| cat95| 19.6|
| 0| cat105| 1.3|
+----+--------+----------+
df1 as:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
| 1| cat67| 28.5|
| 1| cat4| 26.8|
| 1| cat13| 12.6|
| 1| cat23| 5.3|
+----+--------+----------+
and similarly,
df2 as:
+----+--------+----------+
|Hour|Category|TotalValue|
+----+--------+----------+
| 2| cat56| 39.6|
| 2| cat40| 29.7|
| 2| cat187| 27.9|
| 2| cat68| 9.8|
+----+--------+----------+
Any help is highly appreciated.
EDIT 1:
What I have tried:
df.foreach(
row => splitHour(row)
)
def splitHour(row: Row) ={
val Hour=row.getAs[Long]("Hour")
val HourDF= sparkSession.createDataFrame(List((s"$Hour",1)))
val hdf=HourDF.withColumnRenamed("_1","Hour_unique").drop("_2")
val mydf: DataFrame =df.join(hdf,df("Hour")===hdf("Hour_unique"))
mydf.write.mode("overwrite").parquet(s"/home/dev/shaishave/etc/myparquet/$Hour/")
}
PROBLEM WITH THIS STRATEGY:
It took 8 hours when it was run on a dataframe df which had over 1 million rows and spark job was given around 10 GB RAM on single node. So, join is turning out to be highly in-efficient.
Caveat: I have to write each dataframe mydf as parquet which has nested schema that is required to be maintained (not flattened).
df.write.partitionBy("hour").saveAsTable("myparquet")to do this?hour=0,hour=1,etc and I want files to be saved as0,1,etc. Could you please give your insights on how to achieve this?hive.dynamic.partitioning.custom.patternbut one of the advantages of keeping it ashour=0,hour=1, etc. is that when you're runningspark.read.parquet(...)it will automatically understand the underlying dynamic partitions. Another potential approach would be to rename the folders afterwards (i.e. usemvcommand) but you would still run into the same issue thatread.parquetwill not automatically understand the dynamic partitions.