I have seen many similar questions but none of them solves my problem.
I have a very large dataset where I want to find difference for only a few selected rows from the previous row. In the fol example, I would like to get diff() on pVal based on the value in calc, as shown:
pVal calc pDiff
1 .17 False NaN
2 .31 False NaN
3 .46 False NaN
4 .39 True -.07
5 .26 False NaN
6 .6 True .34
Note: pDiff gets NaN by default
One can simply calculate the difference for all the rows and later replace pDiff with NaN against False under 'calc'. But as stated earlier, I have a very large dataset with very few 'True' values in the calc column, so lots of overhead.
I have tried the following:
df['pDiff'] = df[df['calc']==True]['pVal'].diff()
But it gives incorrect results, calculating difference between the rows with calc==True. In our example, the difference for row 6 is computed between rows 6 and 4 (0.6 - 0.39 = 0.21), instead of expected 0.34 between rows 6 and 5. Difference for row 4 remains NaN being the first row with calc==True.
I have the option to iterate through all the rows but that is too slow for me.
I need a solution that calculates and changes values for only those rows where calc contains True.