0

I have a data frame that looks like the following (last column shown w/result that I want to get to):

timestamp                 first_actual  first_required  location    first_initial_pass  first_final
2019-05-03T06:00:00.000Z    3.125       0.000           10B          1.0                1.0 
2019-05-03T18:00:00.000Z    2.975       0.000           10B          1.0                1.0 
2019-05-04T06:00:00.000Z    2.825       0.000           10B          **0.5              1.0**   
2019-05-04T18:00:00.000Z    2.675       0.000           10B          0.0                0.0 
2019-05-05T06:00:00.000Z    2.525       0.000           10B          **0.5              0.0**   

It's sorted by location and time stamp. The column 'first_initial_pass' results in three possible outcomes (0; 0.5; 1) based on some rules using columns 'first_actual' and 'first_required'. I am trying to generate a new column (shown here as first_final) that will copy over the value from column 'first_initial_pass' except for instances where that value is 0.5.

In instances where the value of first_initial_pass is 0.5, that value needs to change to either 0 or 1 in column 'first_final'. It should change to 1 iff the values in both of the two rows above the current row have a value of 1, otherwise it should change to 0 (changes I want to see are noted with asterisks in the data frame).

I am trying to use the shift function to specify these conditions as follows:

data_sorted.loc[( (data_sorted[data_sorted['first_initial_pass'] == 0.5]) &
                              (data_sorted['first_initial_pass'].shift(1) == 1) & 
                              (data_sorted['first_initial_pass'].shift(2) == 1) ), 'first_final'] = 1

However, I get the following error: "TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]", so then I try leaving the bollean piece out like this:

data_sorted.loc[( 
                              (data_sorted['first_initial_pass'].shift(1) == 1) & 
                              (data_sorted['first_initial_pass'].shift(2) == 1) ), 'first_final'] = 1

However, then the rows do not change like I need them to (meaning for just rows that have 0.5 as the value under first_initial_pass column.

Would appreaciate insight into what corrections I can make.

1
  • Check out my answer below and let me know if it works for you Commented Mar 10, 2020 at 14:00

1 Answer 1

1

I guess you could make use of np.where() and assign the value of first_final as 0 or 1 using the df.shift() in the np.where() condition.

Something like this: np.where takes the first arg as the condition and the 2nd arg the true value and the 3rd arg is the false value

df['first_final'] = np.where((df['first_initial_pass']!=0.5), df['first_initial_pass'],
                             np.where((df['first_initial_pass'].shift(1)==1.0)&
                                      (df['first_initial_pass'].shift(2)==1.0),
                                      1, 0))

Output:

                  timestamp  first_actual  ...  first_initial_pass first_final
0  2019-05-03T06:00:00.000Z         3.125  ...                 1.0         1.0
1  2019-05-03T18:00:00.000Z         2.975  ...                 1.0         1.0
2  2019-05-04T06:00:00.000Z         2.825  ...                 0.5         1.0
3  2019-05-04T18:00:00.000Z         2.675  ...                 0.0         0.0
4  2019-05-05T06:00:00.000Z         2.525  ...                 0.5         0.0

Note that you have to be careful about the first two rows if the value is 0.5, then this will be 0 as the df.shift() does not account it.

Sign up to request clarification or add additional context in comments.

1 Comment

Very helpful! Appreciate your solution @davidbilla

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.