Remove the requirement to loop through numpy array

Question

Overview

The code below contains a numpy array clusters with values that are compared against each row of a pandas Dataframe using np.where. The SoFunc function returns rows where all conditions are True and takes the clusters array as input.

Question

I can loop through this array to compare each array element against the respective np.where conditions. How do I remove the requirement to loop but still get the same output?

I appreciate looping though numpy arrays is inefficient and want to improve this code. The actual dataset will be much larger.

Prepare the reproducible mock data

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF


def SoFunc(clust):
    #generate mock data
    df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
    df["Upper_Band"] = 1.1928
    df.loc["2020-03-27", "Upper_Band"] = 1.2118
    df.loc["2020-03-26", "Upper_Band"] = 1.2200
    df["Level"] = np.where((df["High"] >= clust)
                                      & (df["Low"] <= clust)
                                     & (df["High"] >= df["Upper_Band"] ),1,np.NaN
                                      )
    return df.dropna()

Loop through the clusters array

clusters = np.array([1.1929   , 1.2118 ])

l = []

for i in range(len(clusters)):
    l.append(SoFunc(clusters[i]))
    
pd.concat(l)

Output

              Open  High    Low    Close    Upper_Band  Level
date                        
2020-03-19  1.1904  1.1937  1.1832  1.1832  1.1928      1.0
2020-03-25  1.1939  1.1939  1.1864  1.1936  1.1928      1.0
2020-03-27  1.2118  1.2144  1.2039  1.2089  1.2118      1.0

I can not reproduce it, can you give minimal reproducible code? — eroot163pi
– eroot163pi, Commented Aug 28, 2021 at 14:13
You should be able to copy and paste and run this as is. It's data generated with numpy and pandas. What error are you getting? I'm away from my laptop but will check your errors when home. — nipy
– nipy, Commented Aug 28, 2021 at 14:20
Replace lines with df.loc["2020-03-27", "Upper_Band"] = 1.2118 — eroot163pi
– eroot163pi, Commented Aug 28, 2021 at 14:26

score 2 · Accepted Answer · 2021-08-31 07:57:26Z

2

+100

(Edited based on @tdy's comment below)

pandas.merge allows you to make len(clusters) copies of your dataframe and then pare it down to according to the conditions in your SoFunc function.

The cross merge creates a dataframe with a copy of df for each record in clusters_df. The overall result ought to be faster for large dataframes than the loop-based approach, provided you have enough memory to temporarily accommodate the merged dataframe (if not, the operation may spill over onto page / swap and slow down drastically).

import numpy as np
import pandas as pd

def genMockDataFrame(days,startPrice,colName,startDate,seed=None): 
    ''' identical to the example provided '''

    periods = days*24
    np.random.seed(seed)
    steps = np.random.normal(loc=0, scale=0.0018, size=periods)
    steps[0]=0
    P = startPrice+np.cumsum(steps)
    P = [round(i,4) for i in P]

    fxDF = pd.DataFrame({ 
        'ticker':np.repeat( [colName], periods ),
        'date':np.tile( pd.date_range(startDate, periods=periods, freq='H'), 1 ),
        'price':(P)})
    fxDF.index = pd.to_datetime(fxDF.date)
    fxDF = fxDF.price.resample('D').ohlc()
    fxDF.columns = [i.title() for i in fxDF.columns]
    return fxDF
    
# create the base dataframe according to the former SoFunc
df = genMockDataFrame(10,1.1904,'eurusd','19/3/2020',seed=157)
df["Upper_Band"] = 1.1928
df.loc["2020-03-27"]["Upper_Band"] = 1.2118
df.loc["2020-03-26"]["Upper_Band"] = 1.2200

# create a df out of the cluster array
clusters = np.array([1.1929   , 1.2118 ])
clusters_df = pd.DataFrame({"clust": clusters})

# perform the merge, then filter and finally clean up
result_df = (
    pd
    .merge(df.reset_index(), clusters_df, how="cross") # for each entry in cluster, make a copy of df
    .loc[lambda z: (z.Low <= z.clust) & (z.High >= z.clust) & (z.High >= z.Upper_Band), :] # filter the copies down
    .drop(columns=["clust"]) # not needed in result
    .assign(Level=1.0) # to match your result; not really needed
    .set_index("date") # bring back the old index
)

print(result_df)

I recommend inspecting just the result of pd.merge(df.reset_index(), clusters_df, how="cross") to see how it works.

edited Aug 31, 2021 at 7:57

answered Aug 30, 2021 at 21:50

user1706918

Sign up to request clarification or add additional context in comments.

2 Comments

tdy Over a year ago

+1, but can avoid creating/dropping the temp columns by "cross" merging: pd.merge(df.reset_index(), clusters_df, how="cross").loc[...

user1706918 Over a year ago

Ah, it's great that they introduced the cross merge. Thanks for the heads up -- I've updated the answer accordingly.

Collectives™ on Stack Overflow

Remove the requirement to loop through numpy array

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related