I have the following csv file:
LID,Name,age,CID
122,David,29,ECB4
122,Frank,31,ECB4
567,David,29,ECB4
567,Daniel,35,ECB4
I want to group the data first by the CID and later by the LID and save them as json so that they have kind of that structure:
{
"CID": "ECB4",
"logs":[ {
"LID":122,
"body":[{
"name":"David",
"age":29
},
{
"name":"Frank",
"age":31
}
]
},
"LID":567,
"body":[{
"name":"David",
"age":29
},
{
"name":"Daniel",
"age":35
}
]
}
]
}
I have already defined a schema and loaded the data into a dataframe:
sparkSession.sqlContext.read.format("csv")
.option("delimiter",",").schema(someSchema).load("...")
But I have no idea how to group the dataframe in the wanted way. The groupBy functions returns a RelationalGroupedDataset which I can not save as json. A sql query wants that I use an aggregation after grouping.
I would appreciate any help.