Vectorized implementation of a function in pandas

Question

This is my current function:

def partnerTransaction(main_df, ptn_code, intent, retail_unique):

    if intent == 'Frequency':
        return main_df.query('csp_code == @retail_unique & partner_code == @ptn_code')['tx_amount'].count()

    elif intent == 'Total_value':
        return main_df.query('csp_code == @retail_unique & partner_code == @ptn_code')['tx_amount'].sum()

What it does is that it accepts a Pandas DataFrame (DF 1) and three search parameters. The retail_unique is a string that is from another dataframe (DF 2). Currently, I iterate over the rows of DF 2 using itertuples and call around 200 such functions and write to a 3rd DF, this is just an example. I have around 16000 rows in DF 2 so its very slow. What I want to do is vectorize this function. I want it to return a pandas series which has count of tx_amount per retail unique. So the series would be

34 # retail a
54 # retail b
23 # retail c

I would then map this series to the 3rd DF.

Is there any idea on how I might approach this?

EDIT: The first DF contains time based data with each retail appearing multiple times in one column and the tx_amount in another column, like so

Retail  tx_amount
retail_a  50
retail_b  100
retail_a  70
retail_c  20
retail_a  10

The second DF is arranged per retailer:

Retail
retail_a
retail_b
retail_c

jezrael · Accepted Answer · 2017-06-30 05:58:44Z

2

First use merge with left join.

Then groupby by column tx_amount and aggregate by agg functions size and sum together or in second solution separately.

Last reset_index for convert Series to 2 column DataFrame:

If need both output together:

def partnerTransaction_together(df1, df2):
    df = pd.merge(df1, df2, on='Retail', how='left')
    d = {'size':'Frequency','sum':'Total_value'}
    return df.groupby('Retail')['tx_amount'].agg(['size','sum']).rename(columns=d)

print (partnerTransaction_together(df1, df2))
          Frequency  Total_value
Retail                          
retail_a          3          130
retail_b          1          100
retail_c          1           20

But if need use conditions:

def partnerTransaction(df1, df2, intent):
    df = pd.merge(df1, df2, on='Retail', how='left')
    g = df.groupby('Retail')['tx_amount']

    if intent == 'Frequency':
        return g.size().reset_index(name='Frequency')
    elif intent == 'Total_value':
        return g.sum().reset_index(name='Total_value')

print (partnerTransaction(df1, df2, 'Frequency'))
     Retail  Frequency
0  retail_a          3
1  retail_b          1
2  retail_c          1

print (partnerTransaction(df1, df2, 'Total_value'))
     Retail  Total_value
0  retail_a          130
1  retail_b          100
2  retail_c           20

edited Jun 30, 2017 at 5:58

answered Jun 30, 2017 at 5:47

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Neev Parikh Over a year ago

Could you explain how this works? I'm new to Pandas and I understand that you're grouping it by retail and accessing the tx_amount series from it. why are you resettting index?

piRSquared Over a year ago

@NeevParikh, jezrael's agg solution is idiomatic pandas at it's best.

piRSquared · Accepted Answer · 2017-06-30 06:42:53Z

If you want speed, here is a numpy solution using bincount

from collections import OrderedDict

f, u = pd.factorize(df1.Retail.values)

c = np.bincount(f)
s = np.bincount(f, df1.tx_amount.values).astype(df1.tx_amount.dtype)

pd.DataFrame(OrderedDict(Frequency=c, Total_value=s), u)

          Frequency  Total_value
retail_a          3          130
retail_b          1          100
retail_c          1           20

Timing

df1 = pd.DataFrame(dict(
        Retail=np.random.choice(list('abcdefghijklmnopqrstuvwxyz'), 10000),
        tx_amount=np.random.randint(1000, size=10000)
    ))


%%timeit
f, u = pd.factorize(df1.Retail.values)

c = np.bincount(f)
s = np.bincount(f, df1.tx_amount.values).astype(df1.tx_amount.dtype)

pd.DataFrame(OrderedDict(Frequency=c, Total_value=s), u)

1000 loops, best of 3: 607 µs per loop


%%timeit
d = {'size':'Frequency','sum':'Total_value'}
df1.groupby('Retail')['tx_amount'].agg(['size','sum']).rename(columns=d)

1000 loops, best of 3: 1.53 ms per loop

Collectives™ on Stack Overflow

Vectorized implementation of a function in pandas

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related