0

I am trying to add a new column to my dataframe that depends on values that may or may not exist in previous rows. My dataframe looks like this:

index  id  timestamp  sequence_index value  prev_seq_index
0      10  1          0              5      0
1      10  1          1              1      2
2      10  1          2              2      0
3      10  2          0              9      0
4      10  2          1              10     1
5      10  2          2              3      1
6      11  2          0              42     1
7      11  2          1              13     0

Note: there is no relation between index and sequence_index, index is just a counter.

What I want to do is add a column prev_value, that finds the value of the most recent row with the same id and sequence_index == prev_seq_index, if no such previous row exist, use default value, for the purpose of this question I will use default value of -1

index  id  timestamp  sequence_index value  prev_seq_index  prev_value
0      10  1          0              5      0               -1
1      10  1          1              1      2               -1
2      10  1          2              2      0               -1
3      10  2          0              9      0               5  # value from df[index == 0]
4      10  2          1              10     1               1  # value from df[index == 1]
5      10  2          2              3      1               1  # value from df[index == 1]
6      11  2          0              42     1               -1
7      11  2          1              13     0               -1

My current solution is a brute force which is very slow, and I was wondering if there was a faster way:

prev_values = np.zeros(len(df))
i = 0
for index, row in df.iterrows():
    # filter for previous rows with the same id and desired sequence index
    tmp_df = df[(df.id == row.id) & (df.timestamp < row.timestamp) \
                 & (df.sequence_index == row.prev_seq_index)]
    if (len(tmp_df) > 0):
        # get value from the most recent row
        prev_value = tmp_df[tmp_df.index == tmp_df.timestamp.idxmax()].value
    else:
        prev_value = -1
    prev_values[i] = prev_value
    i += 1

df['prev_value'] = prev_values
3
  • Off the top of my head, I cannot think of a faster algorithm. But, you can try using itertuples instead of iterrows for a pretty decent speed boost! Commented Sep 2, 2020 at 4:14
  • For row label 5 , shouldnt the prev_seq_index be 2? or did i misead? Commented Sep 2, 2020 at 4:25
  • prev_seq_index indicates which previous sequence matches the current according to info not displayed here, it does not have to match the same index. Commented Sep 2, 2020 at 19:03

1 Answer 1

1

i would suggest tackling this via a left join. However first you'll need to make sure that your data doesn't have duplicates. You'll need to create a dataframe of most recent timestamps and grab the values.

agg=pd.groupby(['sequence_index']).agg({'timestamp':'max'})

agg=pd.merge(agg,df['timestamp','sequence_index','value'], how='inner', on = ['timestamp','sequence_index'])

agg.rename(columns={'value': 'prev_value'}, inplace=True)

now you can join the data back on itself

df=pd.merge(df,agg,how='left',left_on='prev_seq_index',right_on='sequence_index')

now you can deal with the NaN values

df.prev_value=df.prev_value.fillna(-1)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.