3

Overview

The code below contains a numpy array clusters with values that are compared against each row of a pandas Dataframe using np.where. The SoFunc function returns rows where all conditions are True and takes the clusters array as input.

Question

I can loop through this array to compare each array element against the respective np.where conditions. How do I remove the requirement to loop but still get the same output?

I appreciate looping though numpy arrays is inefficient and want to improve this code. The actual dataset will be much larger.

Prepare the reproducible mock data

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF


def SoFunc(clust):
    #generate mock data
    df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
    df["Upper_Band"] = 1.1928
    df.loc["2020-03-27", "Upper_Band"] = 1.2118
    df.loc["2020-03-26", "Upper_Band"] = 1.2200
    df["Level"] = np.where((df["High"] >= clust)
                                      & (df["Low"] <= clust)
                                     & (df["High"] >= df["Upper_Band"] ),1,np.NaN
                                      )
    return df.dropna()

Loop through the clusters array

clusters = np.array([1.1929   , 1.2118 ])

l = []

for i in range(len(clusters)):
    l.append(SoFunc(clusters[i]))
    
pd.concat(l)

Output

              Open  High    Low    Close    Upper_Band  Level
date                        
2020-03-19  1.1904  1.1937  1.1832  1.1832  1.1928      1.0
2020-03-25  1.1939  1.1939  1.1864  1.1936  1.1928      1.0
2020-03-27  1.2118  1.2144  1.2039  1.2089  1.2118      1.0
4
  • I can not reproduce it, can you give minimal reproducible code? Commented Aug 28, 2021 at 14:13
  • 1
    You should be able to copy and paste and run this as is. It's data generated with numpy and pandas. What error are you getting? I'm away from my laptop but will check your errors when home. Commented Aug 28, 2021 at 14:20
  • SOrry you are correct, I mistook warnings for errors Commented Aug 28, 2021 at 14:22
  • Replace lines with df.loc["2020-03-27", "Upper_Band"] = 1.2118 Commented Aug 28, 2021 at 14:26

1 Answer 1

2
+100

(Edited based on @tdy's comment below)

pandas.merge allows you to make len(clusters) copies of your dataframe and then pare it down to according to the conditions in your SoFunc function.

The cross merge creates a dataframe with a copy of df for each record in clusters_df. The overall result ought to be faster for large dataframes than the loop-based approach, provided you have enough memory to temporarily accommodate the merged dataframe (if not, the operation may spill over onto page / swap and slow down drastically).

import numpy as np
import pandas as pd

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 
    ''' identical to the example provided '''

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF
    
# create the base dataframe according to the former SoFunc
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27"]["Upper_Band"] = 1.2118
df.loc["2020-03-26"]["Upper_Band"] = 1.2200

# create a df out of the cluster array
clusters = np.array([1.1929   , 1.2118 ])
clusters_df = pd.DataFrame({"clust": clusters})

# perform the merge, then filter and finally clean up
result_df = (
    pd
    .merge(df.reset_index(), clusters_df, how="cross") # for each entry in cluster, make a copy of df
    .loc[lambda z: (z.Low <= z.clust) & (z.High >= z.clust) & (z.High >= z.Upper_Band), :] # filter the copies down
    .drop(columns=["clust"]) # not needed in result
    .assign(Level=1.0) # to match your result; not really needed
    .set_index("date") # bring back the old index
)

print(result_df)

I recommend inspecting just the result of pd.merge(df.reset_index(), clusters_df, how="cross") to see how it works.

Sign up to request clarification or add additional context in comments.

2 Comments

+1, but can avoid creating/dropping the temp columns by "cross" merging: pd.merge(df.reset_index(), clusters_df, how="cross").loc[...
Ah, it's great that they introduced the cross merge. Thanks for the heads up -- I've updated the answer accordingly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.