I have a spark dataFrame as below:
INPUT
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt| txnDttm|txnVol|udfChar1| udfChar2| udfChar3|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
|2032000000| 2032000000| C1_F_ANO| CA| USD| PRICEITEM2| UK| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|2032000000| 2032000000| C1_F_ANO| CA| USD| PRICEITEM2| UK| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000| 1322000000| C1_F_ANO| CA| USD| PRICEITEM1| US| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000| 1322000000| C1_F_ANO| CA| USD| PRICEITEM1| US| Country| 2018-06-06| 100|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
Now I want to perform,
- Sum of "txnAmt" column for the records having same accountId and account numbers.
- Drop duplicate records.
Output
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt| txnDttm|txnVol|udfChar1| udfChar2| udfChar3|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
|2032000000| 2032000000| C1_F_ANO| CA| USD| PRICEITEM2| UK| Country| 2018-06-06| 200|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000| 1322000000| C1_F_ANO| CA| USD| PRICEITEM1| US| Country| 2018-06-06| 200|28-MAY-18| 100| TYPE1|PRICEITEM1|PRICEITEM2|
I am not sure how to perform step 1?
I have written code to perform step 2, drop the duplicates based on accountId and account numbers:
String[] colNames = {"accountId", "accountNumber"};
Dataset<RuleOutputParams> finalDs = rulesParamDS.dropDuplicates(colNames);
Can anyone help?
accountIdandaccountNumberbut different values in other columns?txnDttmhas different values whileaccountIdandaccountNumberare the same. How do you know which value oftxnDttmto keep? Both?