Now I have a table with following task:
- Group by functions on DepartmentID and EmployeeID
- Within each group, I need to order them by (ArrivalDate, ArrivalTime) and pick the first one. So if two dates are different, pick the newer date. If two dates are same, pick the newer time.
I am trying with this kinda of approach:
input.select("DepartmenId","EmolyeeID", "ArrivalDate", "ArrivalTime", "Word")
.agg(here will be the function that handles logic from 2)
.show()
What is the syntax to aggregate here?
Thank you in advance.
// +-----------+---------+-----------+-----------+--------+
// |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime| Word |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E1 | 20170101 | 0730 | "YES" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E1 | 20170102 | 1530 | "NO" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E2 | 20170101 | 0730 | "ZOO" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E2 | 20170102 | 0330 | "BOO" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E1 | 20170101 | 0730 | "LOL" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E1 | 20170101 | 1830 | "ATT" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E2 | 20170105 | 1430 | "UNI" |
// +-----------+---------+-----------+-----------+--------+
// output should be
// +-----------+---------+-----------+-----------+--------+
// |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime| Word |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E1 | 20170102 | 1530 | "NO" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E2 | 20170102 | 0330 | "BOO" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E1 | 20170101 | 1830 | "ATT" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E2 | 20170105 | 1430 | "UNI" |
// +-----------+---------+-----------+-----------+--------+