I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.
The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.
This is the structure of the dataframes:
tempDF
patid eventdate consid prodcode issueseq
0 20001 21/04/2005 2728 85 0
1 25001 21/10/2000 3939 40 0
2 25001 21/02/2001 3950 37 0
drugUseDF
index prodcode ... use1 use2
0 171 479 ... diabetes NaN
1 172 9105 ... diabetes NaN
2 173 5174 ... diabetes NaN
tempDrugUse
use1
0 NaN
1 NaN
2 NaN
This is the code:
dfList = []
# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
#predine dataframe where we will store the results to be the same length as the main data dataframe.
tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])
#go through each row of the main data dataframe.
for ind in range(len(tempDF)):
#retrieve the prodcode from the *ind* row of the main data dataframe
prodcodeStr = tempDF.iloc[ind]["prodcode"]
#get the corresponding value from the use1 column matching the prodcode column
useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]
#update the storing dataframe
tempDrugUse.iloc[ind]["use1"] = useStr
print("[DEBUG] End of loop for use1")
dfList.append(tempDrugUse)
The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.