0

I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.

The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.

This is the structure of the dataframes:

tempDF

   patid   eventdate consid prodcode issueseq
0  20001  21/04/2005   2728       85        0
1  25001  21/10/2000   3939       40        0
2  25001  21/02/2001   3950       37        0

drugUseDF

   index prodcode  ...      use1 use2
0    171      479  ...  diabetes  NaN
1    172     9105  ...  diabetes  NaN
2    173     5174  ...  diabetes  NaN

tempDrugUse

  use1
0  NaN
1  NaN
2  NaN

This is the code:

dfList = []                

# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
         
    #predine dataframe where we will store the results to be the same length as the main data dataframe.     
    tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])    

    #go through each row of the main data dataframe.
    for ind in range(len(tempDF)): 

        #retrieve the prodcode from the *ind* row of the main data dataframe
        prodcodeStr = tempDF.iloc[ind]["prodcode"]

        #get the corresponding value from the use1 column matching the prodcode column 
        useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]

        #update the storing dataframe
        tempDrugUse.iloc[ind]["use1"] = useStr

    print("[DEBUG] End of loop for use1")
    dfList.append(tempDrugUse)

The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.

1
  • When a pandas question is of the "how can I speedup […]" kind, I can almost certainly bet that the current code is looping over the rows. Please provide a minimal reproducible example of your datasets, explain the logic of the calculation in detail (not how you would do it in a loop!), and provide the expected output. Commented Aug 25, 2022 at 12:22

2 Answers 2

2

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. That was a huge help. Merge did the trick.
2

If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):

tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
                       'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
                          'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.