I have the following scenario: A dataframe with payment informations like bill id number, client id, bill's value, date of payment, etc. In this DF the same client can have N bills with the same bill_id each. So, I want to:
- Make an in evaluation to check if some bill_id appears more than 1 time for the customer;
- If yes, I want to keep only the most recent record using timestamp criteria;
- If not, I want to select the unique record for that bill_id.
- The result will be stored in a new DF from a df.where clause
I tried the following code with no success:
df_clients_bills = df_clients_bills.where(
when(countDistinct(df_clients_bills.bill_id) > 1, max(df_clients_bills.payment_date)).otherwise(df_clients_bills)
)
I don't know if this is the best approach to solve the question. Any tip which can lead to the solution will be appreciated.