Computing values of a column in dataframe based on another dataframe

Question

Consider below two data frames (unequal length).

df1 = pd.DataFrame({'date': ['2016-10-08', '2016-11-08','2016-12-08','2017-01-08'], 'qty': [1,8,2,4]})
df2 = pd.DataFrame({'date': ['2016-11-12', '2017-01-12'], 'factor': [2,3]})

>>> df1
         date  qty
0  2016-10-08    1
1  2016-11-08    8
2  2016-12-08    2
3  2017-01-08    4

>>> df2
         date  factor
0  2016-11-12       2
1  2017-01-12       3

I want to calculate a new column called factored_quantity in df1 which is as below.

Choose all the factor in df2 whose dates are greater than the date in df1 and multiple the qty by the same to arrive at factored_qty

So my final dataframe would look like

>>> df1
         date  qty      factored_qty
0  2016-10-08    1          6
1  2016-11-08    8          48
2  2016-12-08    2          6
3  2017-01-08    4          12

Explanation:

2016-10-08 in df1 is less than both 2016-11-12 and 2017-01-12 of df2. So multiply by factor 2*3*qty
However, date 2016-12-08 in df1 is greater than 2016-11-12 and less than 2017-01-12 of df2. So only multiply by factor 3*qty

Most of what I have looked are related to
1. Merge two dataframe.
2. Compare two dataframe of equal length
3. Compare two dataframe of un-equal length

However, the current issue is related to computing a value based on collected (factors multiplied) value of another dataframe - where the foreign keys will not be equal.

what happens if the date in df1 > df2 ?

sammywemmy
– sammywemmy

2020-06-15 00:04:39 +00:00
Commented Jun 15, 2020 at 0:04 — sammywemmy
– sammywemmy, Commented Jun 15, 2020 at 0:04
Default factor would be 1. So it would be multiplied by 1

kumar_m_kiran
– kumar_m_kiran

2020-06-15 07:30:57 +00:00
Commented Jun 15, 2020 at 7:30 — kumar_m_kiran
– kumar_m_kiran, Commented Jun 15, 2020 at 7:30

Ch3steR · Accepted Answer · 2020-06-14 12:25:09Z

Make sure your date dtype is valid datetime else use pd.to_datetime. Then you can use pd.Series.to_numpy and use broadcasting here to compare and build a boolean array for boolean indexing. Then use pd.Series.map, use np.prod to get product of the whole array.

mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy() # `_.values` should be avoided instead use `_.to_numpy()`
it = iter(mask)

def mul(x):
    val = np.prod(df2.loc[next(it),'factor'])
    return x*val

df1['factored_qty'] = df1['qty'].map(mul)
df1
        date  qty  factored_qty
0 2016-10-08    1             6
1 2016-11-08    8            48
2 2016-12-08    2             6
3 2017-01-08    4            12

OR

mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy()
l = [np.prod(df2.loc[idx,'factor']) for idx in mask]
     # df2.loc[idx,'factor'].prod()
df['factored_qty'] = df1.qty.mul(l)

Timeit analysis:

# My answer
In [163]: %%timeit
     ...: mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy() # `_.values` should be avoided instead use `_.to_num
     ...: py()`
     ...: it = iter(mask)
     ...: 
     ...: def mul(x):
     ...:     val = np.prod(df2.loc[next(it),'factor'])
     ...:     return x*val
     ...: df1['factored_qty'] = df1['qty'].map(mul)
     ...:
     ...:
1.31 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Stef's answer.
In [164]: %%timeit
     ...: df1['factored_qty'] = df1.apply(lambda x: df2[df2.date>x.date].factor.cumprod().values[-1] * x.qty,axis=1)
     ...:
     ...:
3.65 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#sammy's answer
In [180]: %%timeit
     ...: d = defaultdict(list)
     ...: #iterate through data and append factors that meet criteria
     ...: for (date1, qty), (date2, factor) in product(df1.to_numpy(),df2.to_numpy()) : 
     ...:     if date1 < date2 :
     ...:         d[(date1, qty)].append(factor)
     ...: outcome = {k:[s,np.prod((s,*v))] for (k,s),v in d.items()}
     ...: pd.DataFrame.from_dict(outcome, orient='index', columns=['qty','factored_qty']).reset_index()
     ...:
     ...:
1.49 ms ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Stef · Accepted Answer · 2020-06-14 20:59:18Z

Probably not the fastest solution for large dataframes but it works. We use prod on all rows of df2 that meet the condition.

df1['factored_qty'] = df1.apply(lambda x: df2[df2.date>x.date].factor.prod() * x.qty,axis=1)

Result:

         date  qty  factored_qty
0  2016-10-08    1             6
1  2016-11-08    8            48
2  2016-12-08    2             6
3  2017-01-08    4            12

Update
For larger dataframes we can use merge_asof. We calculate the reverse cumprod, i.e. from last to first row. Unfortunately it becomes a bit convoluted if the last date in df2 is less then the last date in df1 as we have to add a sentinel to df2 (maximum date of df1 with factor 1) in this case.
This method is significantly faster than Ch3steR's and sammywemmy's solutions.

df3 = pd.merge_asof(df1.assign(date=pd.to_datetime(df1.date)),
                    df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]) if(df1.date.max()<df2.date.max())
                        else df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]).append({'date': pd.to_datetime(df1.date.max()), 'factor': 1}, ignore_index=True),
                    'date',
                    direction='forward')
df3.factor *= df3.qty
df3.rename(columns={'factor': 'factored_qty'}, inplace=True)

TIMING for larger dataframes (df1 200 rows, df2 100 rows

import pandas as pd
import numpy as np
n = 100
np.random.seed(0)
df1_ = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(200*n, 2*n, False))[::-1]],
                    'qty': np.random.randint(1, 20, 2*n)})
df2_ = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(100*n, n, False))[::-1]],
                    'factor': np.random.randint(1, 10, n)})

def setup():
    global df1, df2
    df1 = df1_.copy(True)
    df2 = df2_.copy(True)

def method_apply():
    df1['factored_qty'] = df1.apply(lambda x: df2[df2.date>x.date].factor.prod() * x.qty,axis=1)
    return df1

def method_merge():
    df3 = pd.merge_asof(df1.assign(date=pd.to_datetime(df1.date)),
                        df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]) if(df1.date.max()<df2.date.max())
                        else df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]).append({'date': pd.to_datetime(df1.date.max()), 'factor': 1}, ignore_index=True),
                        'date',
                        direction='forward')
    df3.factor *= df3.qty
    df3.rename(columns={'factor': 'factored_qty'}, inplace=True)
    return df3

from itertools import product
from collections import defaultdict
def method_dict():
    d = defaultdict(list)
    df1['date'] = pd.to_datetime(df1['date'])
    df2['date'] = pd.to_datetime(df2['date'])
    for (date1, qty), (date2, factor) in product(df1.to_numpy(),df2.to_numpy()) : 
        if date1 < date2 : 
            d[(date1, qty)].append(factor)
    outcome = {k:[s,np.prod((s,*v))] for (k,s),v in d.items()}
    return pd.DataFrame.from_dict(outcome, orient='index', columns=['qty','factored_qty']).reset_index()

def method_numpy():
    mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy()
    it = iter(mask)
    def mul(x):
        val = np.prod(df2.loc[next(it),'factor'])
        return x*val

    df1['factored_qty'] = df1['qty'].map(mul)
    return df1

Results:

method_apply    220   ms ± 5.99 ms per loop
method_numpy     86.7 ms ± 2.51 ms per loop
method_dict      80.7 ms ± 436 µs per loop
method_merge      8.87 ms ± 68.1 µs per loop

Depending on the random factors in df2 their product may lead to an overflow, this was ignored here. method_dict only works correctly if the last date in df2 is greater than that of df1, this was also ignored for the timings.

sammywemmy · Accepted Answer · 2020-06-14 23:52:13Z

1

Convert to datetimes :

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

Move computation to dictionaries; I'd like to believe computations such as this are faster within dictionaries - it is an assumption; hopefully someone runs tests and proves or debunks this.

from itertools import product
from collections import defaultdict
d = defaultdict(list)
#iterate through data and append factors that meet criteria
for (date1, qty), (date2, factor) in product(df1.to_numpy(),df2.to_numpy()) : 
    if date1 <= date2 : 
        d[(date1, qty)].append(factor)
    else:
        d[(date1, qty)].append(1)

Let's see the contents of d:

print(d)

defaultdict(list,
            {(Timestamp('2016-10-08 00:00:00'), 1): [2, 3],
             (Timestamp('2016-11-08 00:00:00'), 8): [2, 3],
             (Timestamp('2016-12-08 00:00:00'), 2): [1, 3],
             (Timestamp('2017-01-08 00:00:00'), 4): [1, 3]})

Get the product of the filtered data with the quantity :

outcome = {k:[s,np.prod((s,*v))] for (k,s),v in d.items()}

Create dataframe :

pd.DataFrame.from_dict(outcome, orient='index', columns=['qty','factored_qty'])

           qty  factored_qty
2016-10-08  1       6
2016-11-08  8       48
2016-12-08  2       6
2017-01-08  4       12

edited Jun 14, 2020 at 23:52

answered Jun 14, 2020 at 11:39

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

6 Comments

Ch3steR Over a year ago

Added timeit analysis of your answer. it took 1.49ms, stef's 3.65ms and mine 1.31ms. Nice answer though +1.

sammywemmy Over a year ago

thanks for the timing... that was useful information.

Ch3steR Over a year ago

Thank you, I added _.reset_index at the end to give a fair comparison. As OP's output also has 3 columns.

Stef Over a year ago

there seems to be a bug in this method in the general case, for instance for n = 100 np.random.seed(1)

df1 = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(200*n, 2*n, False))[::-1]], 'qty': np.random.randint(1, 20, 2*n)})

df2 = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(100*n, n, False))[::-1]],'factor': np.random.randint(1, 10, n)})

this method return 194 instead of 200 rows.

Stef Over a year ago

This error depends on the actual data: it occurs if the last date of df2 is less than the last date of df1.

|

Collectives™ on Stack Overflow

Computing values of a column in dataframe based on another dataframe

3 Answers 3

Comments

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related