I have a Spark DataFrame that looks like this:
root
|-- employeeName: string (nullable = true)
|-- employeeId: string (nullable = true)
|-- employeeEmail: string (nullable = true)
|-- company: struct (nullable = true)
| |-- companyName: string (nullable = true)
| |-- companyId: string (nullable = true)
| |-- details: struct (nullable = true)
| | |-- founded: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- industry: string (nullable = true)
What I want to do is group by companyId and get an array of employees per company, like this:
root
|-- company: struct (nullable = true)
| |-- companyName: string (nullable = true)
| |-- companyId: string (nullable = true)
| |-- details: struct (nullable = true)
| | |-- founded: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- industry: string (nullable = true)
|-- employees: array (nullable = true)
| |-- employee: struct (nullable = true)
| | |-- employeeName: string (nullable = true)
| | |-- employeeId: string (nullable = true)
| | |-- employeeEmail: string (nullable = true)
Of course, I can easily do that if I just had a pair of (company, employee): (String, String) using map and reduceByKey. But with all the different nested information, I'm not sure what approach to take.
Should I try to flatten everything? Any example to do similar things would be very helpful.