3

I have a timeseries df comprised of daily Rates in column A and the relative change from one day to the next in column B.

DF looks something like the below:

                   IR      Shift
May/24/2019        5.9%    - 
May/25/2019        6%      1.67%      
May/26/2019        5.9%    -1.67
May/27/2019        20.2%   292%
May/28/2019        20.5%   1.4% 
May/29/2019        20%    -1.6% 
May/30/2019        5.1%   -292%
May/31/2019        5.1%     0%

I would like to delete all values in column A which occur between between large relative shifts,> +/- 50%.

So the above DF should look as the below:

                      IR      Shift
May/24/2019        5.9%    - 
May/25/2019        6%       1.67%      
May/26/2019        5.9%    -1.67
May/27/2019        np.nan   292%
May/28/2019        np.nan   1.4% 
May/29/2019        np.nan  -1.6% 
May/30/2019        5.1%    -292%
May/31/2019        5.1%      0%

This is where I've got to so far.... would appreciate some help

 for i, j in df1.iterrows():
      if df1['Shift'][i] > .50 :
          x = df1['IR'][i]
      if df1['Shift'][j] < -.50 :
          y = df1['IR'][j]
      df1['IR'] = np.where(df1['Shift'].between(x,y), df1['Shift'], 
      np.nan)                                                                                                                                  

Error ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

6
  • Ok thanks for the tip, im new to programming. How would i go about trying that...? Commented Sep 27, 2019 at 16:17
  • @yatu, I am not clear on what you mean by '' keep a temp variable, and set to NaN whenever a change relative to the last valid sample is > 50%. Compare current sample to the last valid value". Could you provide an example? Commented Sep 30, 2019 at 1:20
  • what is relative shift ? Commented Sep 30, 2019 at 13:58
  • @rprakash, change of IR ( colmun A ) from on day to the next. Commented Sep 30, 2019 at 13:59
  • 1
    @ALollz, indeed. you are correct. In fact, in my data timeseries, spikes ( >50) occur throughout the data, therefore it will be required to delete the all those values that are in between those big shifts. I will surely try out your code once im back at the house.. but seems like you understand the problem Commented Oct 4, 2019 at 18:17

7 Answers 7

1
+50

We can locate rows between pairs ([1st-2nd), [3rd-4th), ...) of outlier values to then mask the entire DataFrame at once.

Setup

import pandas as pd
import numpy as np

df = pd.read_clipboard()
df = df.apply(lambda x: pd.to_numeric(x.str.replace('%', ''), errors='coerce'))

               IR   Shift
May/24/2019   5.9     NaN
May/25/2019   6.0    1.67
May/26/2019   5.9   -1.67
May/27/2019  20.2  292.00
May/28/2019  20.5    1.40
May/29/2019  20.0   -1.60
May/30/2019   5.1 -292.00
May/31/2019   5.1    0.00

Code

# Locate the extremal values
s = df.Shift.lt(-50) | df.Shift.gt(50)

# Get the indices between consecutive pairs. 
# This doesn't mask 2nd outlier, which matches your output
m = s.cumsum()%2==1

df.loc[m, 'IR'] = np.NaN
#              IR   Shift
#May/24/2019  5.9     NaN
#May/25/2019  6.0    1.67
#May/26/2019  5.9   -1.67
#May/27/2019  NaN  292.00
#May/28/2019  NaN    1.40
#May/29/2019  NaN   -1.60
#May/30/2019  5.1 -292.00
#May/31/2019  5.1    0.00

Here I've added a few more rows to show how this will behave in the case of multiple spikes. IR_modified is how IR will be masked with the above logic.

               IR   Shift  IR_modified
May/24/2019   5.9     NaN          5.9
May/25/2019   6.0    1.67          6.0
May/26/2019   5.9   -1.67          5.9
May/27/2019  20.2  292.00          NaN
May/28/2019  20.5    1.40          NaN
May/29/2019  20.0   -1.60          NaN
May/30/2019   5.1 -292.00          5.1
May/31/2019   5.1    0.00          5.1
June/1/2019   7.0  415.00          NaN
June/2/2019  17.0   15.00          NaN
June/3/2019  27.0   12.00          NaN
June/4/2019  17.0  315.00         17.0
June/5/2019   7.0  -12.00          7.0
Sign up to request clarification or add additional context in comments.

Comments

1

You can also np.where function from numpy as follows:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})                                                                                                                                                                                                       

df['IR'] = np.where(df['Shift'].between(df['Shift']*0.5, df['Shift']*1.5), df['Shift'], np.nan)                                                                                                                                  

In [8]: df                                                                                                                                                                                                                               
Out[8]: 
        Date      IR   Shift
0 2019-05-24     NaN     NaN
1 2019-05-25  0.0167  0.0167
2 2019-05-26     NaN -0.0167
3 2019-05-27  2.9200  2.9200
4 2019-05-28  0.0140  0.0140
5 2019-05-29     NaN -0.0160
6 2019-05-30     NaN -2.9200

1 Comment

Thanks Olel. No dice. The above codes removes data points with rel changes below the .50 for one. Also, main goal of this is remove a cluster of data in the timeseries with rates spike up for a some time then spike down, almost in the same relative fahsion as the initial spike up.
0

Here's an attempt. There could be more "proper" ways to do it but I'm not familiar with all the pandas built-in functions.

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[pd.np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})

>>>df
        Date     IR   Shift
0 2019-05-24  0.059     NaN
1 2019-05-25  0.060  0.0167
2 2019-05-26  0.059 -0.0167
3 2019-05-27  0.202  2.9200
4 2019-05-28  0.205  0.0140
5 2019-05-29  0.200 -0.0160
6 2019-05-30  0.051 -2.9200

df['IR'] = [pd.np.nan if abs(y-z) > 0.5 else x for x, y, z in zip(df['IR'], df['Shift'], df['Shift'].shift(1))]
>>>df
        Date     IR   Shift
0 2019-05-24  0.059     NaN
1 2019-05-25  0.060  0.0167
2 2019-05-26  0.059 -0.0167
3 2019-05-27    NaN  2.9200
4 2019-05-28    NaN  0.0140
5 2019-05-29  0.200 -0.0160
6 2019-05-30    NaN -2.9200

2 Comments

thank you. let me digest the above. First run didn't seem to have an effect.
Perhaps someone new to coding, might find a regular loop easier than this unreadable list comprehension?
0

Using df.at to access a single value for a row/column label pair.

import numpy as np
import pandas as pd
from datetime import datetime

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30),datetime(2019,5,31)], 'IR':[5.9,6,5.9,20.2, 20.5, 20, 5.1, 5.1], 'Shift':[pd.np.nan, 1.67, -1.67, 292, 1.4, -1.6, -292, 0]})

print("DataFrame Before :")
print(df)

count = 1
while (count < len(df.index)):
    if (abs(df.at[count-1, 'Shift'] - df.at[count, 'Shift']) >= 50):
        df.at[count, 'IR'] = np.nan
    count = count + 1

print("DataFrame After :")
print(df)

Output of program:

DataFrame Before :
        Date    IR   Shift
0 2019-05-24   5.9     NaN
1 2019-05-25   6.0    1.67
2 2019-05-26   5.9   -1.67
3 2019-05-27  20.2  292.00
4 2019-05-28  20.5    1.40
5 2019-05-29  20.0   -1.60
6 2019-05-30   5.1 -292.00
7 2019-05-31   5.1    0.00

DataFrame After :
        Date    IR   Shift
0 2019-05-24   5.9     NaN
1 2019-05-25   6.0    1.67
2 2019-05-26   5.9   -1.67
3 2019-05-27   NaN  292.00
4 2019-05-28   NaN    1.40
5 2019-05-29  20.0   -1.60
6 2019-05-30   NaN -292.00
7 2019-05-31   NaN    0.00

3 Comments

row 5 IR in your ''Dataframe After''' should also be NaN. whereas row 6 and 7 should not be NaN. Imagine a stock that spikes after an announcement, and then drops down to its normal level a few months later. The objective here is to remove the prices that occurred during that elevated period... so row 6 and 7, represents the normal period...row 3/4/5 are the elevated period... The price hike in IR and the return to ''normal'' both occur in similar magnitude, 292%, although in opposite direction.
Therefore how do we set the code to trigger once >50% is breached, and stop when a similar but opposite movement of -50% occurs - converting to NaN all values in between
also, not sure why youre taking the difference to determine if above 50% given, the Shift column tells us how much the change one day to the next is... and following the ''spike'' the Shift will be minuscule day to day except for when IR drops back down...
0

As per your description of triggering this on any large shift, positive or negative, you could do this:

import pandas as pd
import numpy as np
from datetime import datetime

df = pd.DataFrame({'Date':[datetime(2019,5,24), datetime(2019,5,25), datetime(2019,5,26), datetime(2019,5,27), datetime(2019,5,28),datetime(2019,5,29),datetime(2019,5,30)], 'IR':[0.059,0.06,0.059,0.202, 0.205, 0.2, 0.051], 'Shift':[np.nan, 0.0167, -0.0167, 2.92, 0.014, -0.016, -2.92]})

df.loc[(abs(df.Shift) > .5).cumsum() % 2 == 1, 'IR'] = np.nan

        Date     IR   Shift
0 2019-05-24  0.059     NaN
1 2019-05-25  0.060  0.0167
2 2019-05-26  0.059 -0.0167
3 2019-05-27    NaN  2.9200
4 2019-05-28    NaN  0.0140
5 2019-05-29    NaN -0.0160
6 2019-05-30  0.051 -2.9200

Steps:

  • abs(df.Shift) > .5: Find shift of above +/- 50%

  • .cumsum(): Gives unique values to each period, where the odd numbered periods are the ones we want to omit.

  • % 2 == 1: Checks which rows have odd numbers for cumsum().

Note: This does not work if what you want is to constrain this so that every positive spike needs to be followed by a negative spike, or vice versa.

Comments

0

Was not sure about your shift, so calculated again. Does this works for you?

import pandas as pd
import numpy as np

df.drop(columns=['Shift'], inplace=True)  ## calculated via method below
df['nextval'] = df['IR'].shift(periods=1)

def shift(current, previous):
    return (current-previous)/previous * 100

indexlist=[]  ## to save index that will be set to null
prior=0  ## temporary flag to store value prior to a peak 
flag=False

for index, row in df.iterrows():    
    if index==0: ## to skip first row of data
        continue

    if flag==False and (shift(row[1], row[2])) > 50:   ## to check for start of peak
        prior=row[2]
        indexlist.append(index)
        flag=True
        continue

    if flag==True:  ## checking until when the peak lasts
        if (shift(row[1], prior)) > 50:
            indexlist.append(index)

df.loc[df.index.isin(indexlist),'IR'] = np.nan ## replacing with nan

Output on print(df)

          date   IR  nextval
0  May/24/2019  5.9      NaN
1  May/25/2019  6.0      5.9
2  May/26/2019  5.9      6.0
3  May/27/2019  NaN      5.9
4  May/28/2019  NaN     20.2
5  May/29/2019  NaN     20.5
6  May/30/2019  5.1     20.0
7  May/31/2019  5.1      5.1

Comments

0

df.loc[df['Shift']>0.5,'IR'] = np.nan

1 Comment

Please provide some explanation of your answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.