I need a help regarding Aggregate function in Pyspark Dataframe. I need to calculate expenses made by customer based on 'buy' or 'sell'.
If buy means I should subtract the amount from the credit limit, if sell means I should add the amount to credit limit
Below is my table
+----------+-----------------+------+----------+----------------+
|account_id|credit_card_limit|amount| date|transaction_code|
+----------+-----------------+------+----------+----------------+
| 12345| 1000| 400|01/06/2020| buy|
| 12345| 1000| 100|02/06/2020| buy|
| 12345| 1000| 500|02/06/2020| sell|
| 12345| 1000| 200|03/06/2020| buy|
| 22332| 2000| 1000|02/06/2020| buy|
| 22332| 2000| 200|03/06/2020| buy|
+----------+-----------------+------+----------+----------------+
I tried a code but it didn't give me correct results.Below is my code
w = Window.partitionBy(f.lit(0)).orderBy('date')
finaldf=df.groupBy('account_id','credit_card_limit','date').agg(f.sum(f.when(f.col('transaction_code')=='buy',-f.col('amount')).\
otherwise(f.col('amount'))).alias('expenses')).\
select('*',(f.col('credit_card_limit')+f.sum(f.col('expenses')).over(w)).alias('credit_left'))
The output I got:
+----------+-----------------+----------+--------+-----------+
|account_id|credit_card_limit| date|expenses|credit_left|
+----------+-----------------+----------+--------+-----------+
| 12345| 1000|01/06/2020| -400| 600|
| 12345| 1000|02/06/2020| 400| 0|
| 12345| 1000|03/06/2020| -200| -400|
| 22332| 2000|02/06/2020| -1000| 1000|
| 22332| 2000|03/06/2020| -200| 800|
+----------+-----------------+----------+--------+-----------+
Here as you can see the credit_left column has wrong answers.
Expected Output:
+----------+-----------------+----------+--------+-----------+
|account_id|credit_card_limit| date|expenses|credit_left|
+----------+-----------------+----------+--------+-----------+
| 12345| 1000|01/06/2020| -400| 600|
| 12345| 1000|02/06/2020| 400| 1000|
| 12345| 1000|03/06/2020| -200| 800|
| 22332| 2000|02/06/2020| -1000| 1000|
| 22332| 2000|03/06/2020| -200| 800|
+----------+-----------------+----------+--------+-----------+
I also need to make to credit_left to credit_card_limit incase if the value exceeds the credit_limit.Please help me to solve this problem. Thanks a lot !!
w = Window.partitionBy(f.col("account_id")).orderBy('date')and try the same code , i think it works..