Spark Java DataFrame sum and drop duplicate based on columns

Question

I have a spark dataFrame as below:

INPUT

+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt|  txnDttm|txnVol|udfChar1|  udfChar2|  udfChar3|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
|2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
|2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|   100|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|

Now I want to perform,

Sum of "txnAmt" column for the records having same accountId and account numbers.
Drop duplicate records.

Output

+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|txnAmt|  txnDttm|txnVol|udfChar1|  udfChar2|  udfChar3|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+------+---------+------+--------+----------+----------+
|2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|   200|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|
|1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|   200|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2|

I am not sure how to perform step 1?

I have written code to perform step 2, drop the duplicates based on accountId and account numbers:

String[] colNames = {"accountId", "accountNumber"};
Dataset<RuleOutputParams> finalDs = rulesParamDS.dropDuplicates(colNames);

Can anyone help?

Can two rows have the same accountId and accountNumber but different values in other columns? — Shaido
– Shaido, Commented Jun 6, 2018 at 6:51
So when you do the aggregation (or dropping), if for example the txnDttm has different values while accountId and accountNumber are the same. How do you know which value of txnDttm to keep? Both? — Shaido
– Shaido, Commented Jun 6, 2018 at 7:03
@cricket_007 finalDS has only unique rows based on values of accountId and account number, I thought it has dropped remaining rows, can you please point something out which can lead me to complete the requirement. — Curious one
– Curious one, Commented Jun 6, 2018 at 7:03

OneCricketeer · Accepted Answer · 2018-06-06 13:27:03Z

1

Load data and make a SQL table for it

val df = spark.read.format("csv").option("header", true).load("data.csv")
df.createOrReplaceTempView("t")

Then, what you need are called Window Aggregation functions, plus a trick with row_number() to remove the duplicates

val df2 = spark.sql("""SELECT * FROM (
  SELECT *, 
    sum(txnAmt) OVER (PARTITION BY accountId, accountNumber) s, 
    row_number() OVER (PARTITION BY accountId, accountNumber ORDER BY processingDate) r FROM t) 
  WHERE r=1""")
  .drop("txnAmt", "r")
  .withColumnRenamed("s", "txnAmt")

And if you show that, you'll see

+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+
| accountId|accountNumber|acctNumberTypeCode|cisDivision|currencyCode|priceItemCd|priceItemParam|priceItemParamCode|processingDate|  txnDttm|txnVol|udfChar1|  udfChar2|  udfChar3|txnAmt|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+
|2032000000|   2032000000|          C1_F_ANO|         CA|         USD| PRICEITEM2|            UK|           Country|    2018-06-06|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2| 200.0|
|1322000000|   1322000000|          C1_F_ANO|         CA|         USD| PRICEITEM1|            US|           Country|    2018-06-06|28-MAY-18|   100|   TYPE1|PRICEITEM1|PRICEITEM2| 200.0|
+----------+-------------+------------------+-----------+------------+-----------+--------------+------------------+--------------+---------+------+--------+----------+----------+------+

As a side note, one might try to add more columns to the following, but you would need to add them to the group by clause

spark.sql("SELECT accountId, accountNumber, SUM(txnAmt) txnAmt FROM t GROUP BY accountId, accountNumber").show
+----------+-------------+------+
| accountId|accountNumber|txnAmt|
+----------+-------------+------+
|2032000000|   2032000000| 200.0|
|1322000000|   1322000000| 200.0|
+----------+-------------+------+

edited Jun 6, 2018 at 13:27

answered Jun 6, 2018 at 7:23

OneCricketeer

193k20 gold badges147 silver badges277 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Curious one Over a year ago

Thanks a lot for the explanation however when I have written val df2 = spark.sql("""SELECT * FROM ( SELECT *, sum(txnAmt) OVER (PARTITION BY accountId, accountNumber) s, row_number() OVER (PARTITION BY accountId, accountNumber ORDER BY processingDate) r FROM t) WHERE r=1""") .drop("txnAmt", "r") .withColumnRenamed("s", "txnAmt") query it is giving compiler error

OneCricketeer Over a year ago

Java doesn't use triple quotes like Scala does. Or val keyword

Curious one Over a year ago

yes I tried with single quotes and compiler error was resolved. Thanks.

Curious one Over a year ago

When I am writing the final DF in csv file instead of creating one csv file , it creates different csv files for different rows(records) , why is so?

OneCricketeer Over a year ago

Spark is distributed over many CPUs and computers. When you do write("file.csv"), it makes a directory. One file per Spark partition. forums.databricks.com/questions/2848/…

|

Collectives™ on Stack Overflow

Spark Java DataFrame sum and drop duplicate based on columns

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related