Using spark dataframe i need to convert the row values into column and partition by user id and create a csv file.
val someDF = Seq(
("user1", "math","algebra-1","90"),
("user1", "physics","gravity","70"),
("user3", "biology","health","50"),
("user2", "biology","health","100"),
("user1", "math","algebra-1","40"),
("user2", "physics","gravity-2","20")
).toDF("user_id", "course_id","lesson_name","score")
someDF.show(false)
+-------+---------+-----------+-----+
|user_id|course_id|lesson_name|score|
+-------+---------+-----------+-----+
| user1| math| algebra-1| 90|
| user1| physics| gravity| 70|
| user3| biology| health| 50|
| user2| biology| health| 100|
| user1| math| algebra-1| 40|
| user2| physics| gravity-2| 20|
+-------+---------+-----------+-----+
val result = someDF.groupBy("user_id", "course_id").pivot("lesson_name").agg(first("score"))
result.show(false)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user1| math| 90| null| null| null|
| user2| biology| null| null| null| 100|
| user2| physics| null| null| 20| null|
| user1| physics| null| 70| null| null|
+-------+---------+---------+-------+---------+------+
With the above code i'm able to convert row value(lesson_name) to column name.
But I need to save the out in csv in a course_wise
Expected out in csv should be like this below formate.
biology.csv // Expected Output
+-------+---------+------+
|user_id|course_id|health|
+-------+---------+------+
| user3| biology| 50 |
| user2| biology| 100 |
+-------+---------+-------
physics.csv // Expected Output
+-------+---------+---------+-------
|user_id|course_id|gravity-2|gravity|
+-------+---------+---------+-------+
| user2| physics| 50 | null |
| user1| physics| 100 | 70 |
+-------+---------+---------+-------+
**Note: Each course in a csv it should contain only it's specifi lesson names and it should not contain any non relevant course lesson names.
Actually in csv i'm able to in below formate**
result.write
.partitionBy("course_id")
.mode("overwrite")
.format("com.databricks.spark.csv")
.option("header", "true")
.save(somepath)
eg:
biology.csv // Wrong output, Due to it is containing non-relevant course lesson's(algebra-1,gravity-2,algebra-1)
+-------+---------+---------+-------+---------+------+
|user_id|course_id|algebra-1|gravity|gravity-2|health|
+-------+---------+---------+-------+---------+------+
| user3| biology| null| null| null| 50|
| user2| biology| null| null| null| 100|
+-------+---------+---------+-------+---------+------+
Anyone can help to solve this problem ?