Error in aggregate function Pyspark Dataframe

Question

I need a help regarding Aggregate function in Pyspark Dataframe. I need to calculate expenses made by customer based on 'buy' or 'sell'.

If buy means I should subtract the amount from the credit limit, if sell means I should add the amount to credit limit

Below is my table

+----------+-----------------+------+----------+----------------+
|account_id|credit_card_limit|amount|      date|transaction_code|
+----------+-----------------+------+----------+----------------+
|     12345|             1000|   400|01/06/2020|             buy|
|     12345|             1000|   100|02/06/2020|             buy|
|     12345|             1000|   500|02/06/2020|            sell|
|     12345|             1000|   200|03/06/2020|             buy|
|     22332|             2000|  1000|02/06/2020|             buy|
|     22332|             2000|   200|03/06/2020|             buy|
+----------+-----------------+------+----------+----------------+

I tried a code but it didn't give me correct results.Below is my code

w = Window.partitionBy(f.lit(0)).orderBy('date')
finaldf=df.groupBy('account_id','credit_card_limit','date').agg(f.sum(f.when(f.col('transaction_code')=='buy',-f.col('amount')).\
              otherwise(f.col('amount'))).alias('expenses')).\
    select('*',(f.col('credit_card_limit')+f.sum(f.col('expenses')).over(w)).alias('credit_left'))

The output I got:

    +----------+-----------------+----------+--------+-----------+
    |account_id|credit_card_limit|      date|expenses|credit_left|
    +----------+-----------------+----------+--------+-----------+
    |     12345|             1000|01/06/2020|    -400|        600|
    |     12345|             1000|02/06/2020|     400|          0|
    |     12345|             1000|03/06/2020|    -200|       -400|
    |     22332|             2000|02/06/2020|   -1000|       1000|
    |     22332|             2000|03/06/2020|    -200|        800|
    +----------+-----------------+----------+--------+-----------+

Here as you can see the credit_left column has wrong answers.

Expected Output:

    +----------+-----------------+----------+--------+-----------+
    |account_id|credit_card_limit|      date|expenses|credit_left|
    +----------+-----------------+----------+--------+-----------+
    |     12345|             1000|01/06/2020|    -400|        600|
    |     12345|             1000|02/06/2020|     400|       1000|
    |     12345|             1000|03/06/2020|    -200|        800|
    |     22332|             2000|02/06/2020|   -1000|       1000|
    |     22332|             2000|03/06/2020|    -200|        800|
    +----------+-----------------+----------+--------+-----------+

I also need to make to credit_left to credit_card_limit incase if the value exceeds the credit_limit.Please help me to solve this problem. Thanks a lot !!

I have 2 purchases made by the customer on the same date .So I have grouped them to get a single output on 02/06/2020 . So the output result has 5 rows. — keerthi007
– keerthi007, Commented Jul 3, 2020 at 14:49
Suppose 'A' has 'buy' of '400' on 01/06/2020 ,the credit left at the end of the day is '600'(100-400) . If he has 'sell' of 500 on 02/06/2020 and 'buy' of 100 on 02/06/2020 the total expense of him at the end of day is 500-100= 400 which I add to his previous credit_left (i.e) 600+400=1000 — keerthi007
– keerthi007, Commented Jul 3, 2020 at 14:54
No as long I am getting a customer's transactions together any group can come first — keerthi007
– keerthi007, Commented Jul 3, 2020 at 15:57
in that case can you change the partition in the window window to w = Window.partitionBy(f.col("account_id")).orderBy('date') and try the same code , i think it works.. — anky
– anky, Commented Jul 3, 2020 at 16:05

Manish · Accepted Answer · 2020-07-03 16:05:03Z

I have assumed that for account 22332 for date 03/06/2020 credicardlimit is 1000 as per logic and expected answer. Please try this and let me know if it works.

df = spark.sql("""
select 12345 as account_id, 1000 as credit_card_limit, 400 as amount, '01/06/2020' as date, 'buy' as  transaction_code
union                                                                                                                                                                                                   
select 12345 as account_id, 1000 as credit_card_limit, 100 as amount, '02/06/2020' as date, 'buy' as  transaction_code
union                                                                                                                                                                                                   
select 12345 as account_id, 1000 as credit_card_limit, 500 as amount, '02/06/2020' as date, 'sell' as  transaction_code
union                                                                                                                                                                                                   
select 12345 as account_id, 1000 as credit_card_limit, 200 as amount, '03/06/2020' as date, 'buy' as  transaction_code
union                                                                                                                                                                                                   
select 22332 as account_id, 2000 as credit_card_limit, 1000 as amount, '02/06/2020' as date, 'buy' as  transaction_code
union
select 22332 as account_id, 1000 as credit_card_limit, 200 as amount, '03/06/2020' as date, 'buy' as  transaction_code
""").orderBy("account_id","date")

df.show()
# source data
# +----------+-----------------+------+----------+----------------+
# |account_id|credit_card_limit|amount|      date|transaction_code|
# +----------+-----------------+------+----------+----------------+
# |     12345|             1000|   400|01/06/2020|             buy|
# |     12345|             1000|   100|02/06/2020|             buy|
# |     12345|             1000|   500|02/06/2020|            sell|
# |     12345|             1000|   200|03/06/2020|             buy|
# |     22332|             2000|  1000|02/06/2020|             buy|
# |     22332|             1000|   200|03/06/2020|             buy|
# +----------+-----------------+------+----------+----------------+

df.createOrReplaceTempView("tmp1")

data1 = spark.sql("""select  account_id,
        credit_card_limit,
        amount, 
        date,
        transaction_code,
        lead(amount) over(partition by account_id order by date) as lead_amt,
        case when transaction_code = 'buy' then -1 * amount else amount end as amount_modified 
from tmp1
order by account_id,date
""")
data1.show()
# +----------+-----------------+------+----------+----------------+--------+---------------+
# |account_id|credit_card_limit|amount|      date|transaction_code|lead_amt|amount_modified|
# +----------+-----------------+------+----------+----------------+--------+---------------+
# |     12345|             1000|   400|01/06/2020|             buy|     100|           -400|
# |     12345|             1000|   100|02/06/2020|             buy|     500|           -100|
# |     12345|             1000|   500|02/06/2020|            sell|     200|            500|
# |     12345|             1000|   200|03/06/2020|             buy|    null|           -200|
# |     22332|             2000|  1000|02/06/2020|             buy|     200|          -1000|
# |     22332|             1000|   200|03/06/2020|             buy|    null|           -200|
# +----------+-----------------+------+----------+----------------+--------+---------------+

data1.createOrReplaceTempView("tmp2")

data2 = spark.sql("""
select account_id,
        credit_card_limit,
        date,
        sum(amount_modified) as expenses,
        case when (credit_card_limit + sum(amount_modified)) > credit_card_limit 
             then credit_card_limit else (credit_card_limit + sum(amount_modified)) 
        end as credit_left
from tmp2
group by account_id, credit_card_limit, date 
order by account_id, date
""")

data2.show()

# +----------+-----------------+----------+--------+-----------+
# |account_id|credit_card_limit|      date|expenses|credit_left|
# +----------+-----------------+----------+--------+-----------+
# |     12345|             1000|01/06/2020|    -400|        600|
# |     12345|             1000|02/06/2020|     400|       1000|
# |     12345|             1000|03/06/2020|    -200|        800|
# |     22332|             2000|02/06/2020|   -1000|       1000|
# |     22332|             1000|03/06/2020|    -200|        800|
# +----------+-----------------+----------+--------+-----------+

Thanks a lot Manish ! I also got the output using this code. It was very crisp and clear! Thanks a lot for your efforts and support !

anky · Accepted Answer · 2020-07-03 20:38:02Z

1

I think you need to change the window to :

w = Window.partitionBy(f.col("account_id")).orderBy('date')

then your code works:

w = Window.partitionBy(f.col("account_id")).orderBy('date')

finaldf = (df.groupBy('account_id','credit_card_limit','date')
                .agg(f.sum(f.when(f.col('transaction_code')=='buy',-f.col('amount'))
              .otherwise(f.col('amount'))).alias('expenses')).
    select('*',(f.col('credit_card_limit')+f.sum(f.col('expenses')).over(w))
                                                      .alias('credit_left')))
finaldf.show()

finaldf.show()

+----------+-----------------+----------+--------+-----------+
|account_id|credit_card_limit|      date|expenses|credit_left|
+----------+-----------------+----------+--------+-----------+
|     12345|             1000|01/06/2020|    -400|        600|
|     12345|             1000|02/06/2020|     400|       1000|
|     12345|             1000|03/06/2020|    -200|        800|
|     22332|             2000|02/06/2020|   -1000|       1000|
|     22332|             2000|03/06/2020|    -200|        800|
+----------+-----------------+----------+--------+-----------+

edited Jul 3, 2020 at 20:38

answered Jul 3, 2020 at 16:08

anky

75.3k11 gold badges46 silver badges76 bronze badges

2 Comments

keerthi007 Over a year ago

Thanks a lot Anky ! I got the output. Thanks a lot for your efforts and support !

anky Over a year ago

@keerthi007 Glad that helped you, happy coding..!!

Collectives™ on Stack Overflow

Error in aggregate function Pyspark Dataframe

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related