Speed up operations over Python Pandas dataframes

Question

I would like to speed up a loop over a python Pandas Dataframe. Unfortunately, decades of using low-level languages mean I often struggle to find prepackaged solutions. Note: data is private, but I will see if I can fabricate something and add it into an edit if it helps.

The code has three pandas dataframes: drugUseDF, tempDF, which holds the data, and tempDrugUse, which stores what's been retrieved. I look over every row of tempDF (there will be several million rows), retrieving the prodcode identified from each row and then using that to retrieve the corresponding value from use1 column in the drugUseDF. I've added comments to help navigate.

This is the structure of the dataframes:

tempDF

   patid   eventdate consid prodcode issueseq
0  20001  21/04/2005   2728       85        0
1  25001  21/10/2000   3939       40        0
2  25001  21/02/2001   3950       37        0

drugUseDF

   index prodcode  ...      use1 use2
0    171      479  ...  diabetes  NaN
1    172     9105  ...  diabetes  NaN
2    173     5174  ...  diabetes  NaN

tempDrugUse

  use1
0  NaN
1  NaN
2  NaN

This is the code:

dfList = []                

# if the drug dataframe contains the use1 column. Can this be improved?
if sum(drugUseDF.columns.isin(["use1"])) == 1:
         
    #predine dataframe where we will store the results to be the same length as the main data dataframe.     
    tempDrugUse = DataFrame(data=None, index=range(len(tempDF.index)), dtype=np.str, columns=["use1"])    

    #go through each row of the main data dataframe.
    for ind in range(len(tempDF)): 

        #retrieve the prodcode from the *ind* row of the main data dataframe
        prodcodeStr = tempDF.iloc[ind]["prodcode"]

        #get the corresponding value from the use1 column matching the prodcode column 
        useStr = drugUseDF[drugUseDF.loc[:, "prodcode"] == prodcodeStr]["use1"].values[0]

        #update the storing dataframe
        tempDrugUse.iloc[ind]["use1"] = useStr

    print("[DEBUG] End of loop for use1")
    dfList.append(tempDrugUse)

The order of the data matters. I can't retrieve multiple rows by matching the prodcode because each row has a date column. Retrieving multiple rows and adding them to the tempDrugUse dataframe could mean that the rows are no longer in chronological date order.

When a pandas question is of the "how can I speedup […]" kind, I can almost certainly bet that the current code is looping over the rows. Please provide a minimal reproducible example of your datasets, explain the logic of the calculation in detail (not how you would do it in a loop!), and provide the expected output. — mozway
– mozway, Commented Aug 25, 2022 at 12:22

Paul Bendevis · Accepted Answer · 2022-08-25 13:23:56Z

2

When trying to combine data in two dataframes you should use the merge (similar to JOIN in sql-like languages). Performance wise, you should never loop over the rows - you should use the pandas built-in methods whenever possible. Ordering can be achieved with the sort_values method.

answered Aug 25, 2022 at 13:23

Paul Bendevis

2,6412 gold badges37 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Anthony Nash Over a year ago

Thank you. That was a huge help. Merge did the trick.

max_jump · Accepted Answer · 2022-08-25 13:26:44Z

2

If I understand you correctly, you want to map the prodcode from both tables. You can do this via pd.merge (please note the example in the code below differs from your data):

tempDF = pd.DataFrame({'patid': [20001, 25001, 25001],
                       'prodcode': [101,102,103]})
drugUseDF = pd.DataFrame({'prodcode': [101,102,103],
                          'use1': ['diabetes', 'hypertonia', 'gout']})
merged_df = pd.merge(tempDF, drugUseDF, on='prodcode', how='left')

answered Aug 25, 2022 at 13:26

max_jump

3211 silver badge7 bronze badges

Collectives™ on Stack Overflow

Speed up operations over Python Pandas dataframes

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related