1

I am building a backtesting project in Python using Pandas.

I have:

  • A large tick / 1-minute level DataFrame (df) with full market data.

  • A 15-minute interval DataFrame (df_15) created from it using resampling.

  • In my strategy, I calculate entry points from the 15-minute candles, but I need to check the stop loss trigger from the underlying 1-minute data.

    So during the backtest, I keep "going back" to the original DataFrame to find the exact stop loss hit.

This works, but is it efficient to repeatedly go back to the full df for each row of df_15 to check stop loss conditions?

Would it be better to pre-compute stop loss conditions when generating the 15-minute candles (e.g., by merging/joining df with df_15 beforehand)?

For large datasets (millions of rows), what is the most efficient pattern in Pandas:

  • per-row lookups into the original df; or
  • vectorized/pre-merged stop loss computations?

Here's a simplified version of my code:

  import pandas as pd

  df_15 = df.resample('15min').agg({
    'Open': 'first',
    'High': 'max',
    'Low': 'min',
    'Close': 'last'
  })

  results = []

> for idx, row in df_15.iterrows():
    entry_price = row['Open']
    stop_loss = entry_price - 10   # example fixed SL

    # Go back to original df to check if SL was hit in this 15-min interval
    interval_slice = df[(df.index >= idx) & (df.index < idx + pd.Timedelta('15min'))]

    sl_hit = interval_slice[interval_slice['Low'] <= stop_loss]

    if not sl_hit.empty:
        results.append({"time": idx, "stop_loss_hit": True})
    else:
        results.append({"time": idx, "stop_loss_hit": False})

 out_df = pd.DataFrame(results)
 print(out_df)

What I tried:

The above iterative approach works fine for small data.

But when running on large datasets, performance slows down noticeably because of repeated slicing into the original df.

I want to know the best practice: keep the logic as-is, or restructure my data so stop loss checks are pre-computed / vectorized.

5
  • 4
    I would avoid looping over dataframes if at all possible. While it might not be immediately obvious how to vectorize the function, there is almost always a way and it will almost always improve performance. Commented Oct 2 at 14:15
  • 2
    It's hard to say w/o a reproducible example, but my guess is that you will either need to merge the two dataframes using merge_asof at which point you can perform your checks on the primary dataframe, or you may be able to us a pandas.DataFrame.rolling to just do it all at once. Either way, if you provide a reproducible example on a small subset of data it would be possible to help you solve the problem. Commented Oct 2 at 14:20
  • shorter (without if/else) {"stop_loss_hit": not sl_hit.empty} Commented Oct 2 at 14:28
  • 2
    The stocks example in How to make good reproducible pandas examples contains useful advice for sharing an example that we can run. Commented Oct 2 at 17:14
  • Actually the main thing is that at first i filtered the dataframe because u can consider it that i wanted an entry of that records only but that process only i wanted without entry df too to make comparisons ...now that works fr me as i keep both df with me but i think that takes time ! so i wanted to confirm or ask tht is that a good wy or not Commented Oct 3 at 3:13

1 Answer 1

4

Always try to avoid using iterative loops when you work with Dataframes.

Doing this purges all the advantages of a Dataframe.

Instead you could use smth like the code below. I dont create an additional Dataframe but instead i add the resample results as columns to the existing dataframe.

import pandas as pd
import random
import numpy as np
import datetime

starttime=datetime.datetime.now()
print(starttime)
index = pd.date_range('1/ 1/ 2000', periods=99000, freq='min')
# create a random Dataframe since theres no example
df = pd.DataFrame(random.sample(list(np.random.randint(0,100,99000)), 99000), index=index, columns=['Values'])
#Adding the first,max,min,last Columns with Values
df[['Open','High','Low','Close']] = df.resample('15min').agg([
    'first',
    'max',
    'min',
    'last'
  ])
df= df.fillna(method='ffill')

#checking where condidition and adding results to additional Column
df['stop_loss_hit'] = np.where(df['Values']<=(df['Open']-10), True, False)
endtime=datetime.datetime.now()
print(endtime-starttime)
print(df.to_string())

Btw. next time buid a minimal reproducible pandas example like mentioned above. Since its always kindof a hazzle to understand these not reproducable snippets and furthermore to proof if the proposed solution is what you are looking for.

Greetings Tetris

Sign up to request clarification or add additional context in comments.

4 Comments

For everyone coming to this comment I have one doubt I don't know like I always feel stuck in pandas I think every thing through a pov of python and don't know likeni even feel like I can't write its code by myself if anyone can help me ??
I like the former approach with forward fill. It looks promising. I also agree that iterrows() is not optimal. Perhaps, another approach would be to merge the two columns ( one from each dataframe:'open' and the condition) to create a new dataframe, then using .unstack() to form a hierarchical index, then .ffill('opens') and return df.loc[df['open']<=df[val]+10]
@its m, with using pandad you have to get away from the usual programming mindset and more think in a SQL way. So if you want to improve in pandas one way would be to learn some basic SQL. Atleast thats what helped me alot.
thanks !! for your concern...yeah i am aware with basic sql ..would try to be better at it

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.