1

Consider below two data frames (unequal length).

df1 = pd.DataFrame({'date': ['2016-10-08', '2016-11-08','2016-12-08','2017-01-08'], 'qty': [1,8,2,4]})
df2 = pd.DataFrame({'date': ['2016-11-12', '2017-01-12'], 'factor': [2,3]})

>>> df1
         date  qty
0  2016-10-08    1
1  2016-11-08    8
2  2016-12-08    2
3  2017-01-08    4

>>> df2
         date  factor
0  2016-11-12       2
1  2017-01-12       3

I want to calculate a new column called factored_quantity in df1 which is as below.

Choose all the factor in df2 whose dates are greater than the date in df1 and multiple the qty by the same to arrive at factored_qty

So my final dataframe would look like

>>> df1
         date  qty      factored_qty
0  2016-10-08    1          6
1  2016-11-08    8          48
2  2016-12-08    2          6
3  2017-01-08    4          12

Explanation:

  1. 2016-10-08 in df1 is less than both 2016-11-12 and 2017-01-12 of df2. So multiply by factor 2*3*qty
  2. However, date 2016-12-08 in df1 is greater than 2016-11-12 and less than 2017-01-12 of df2. So only multiply by factor 3*qty

Most of what I have looked are related to
1. Merge two dataframe.
2. Compare two dataframe of equal length
3. Compare two dataframe of un-equal length

However, the current issue is related to computing a value based on collected (factors multiplied) value of another dataframe - where the foreign keys will not be equal.

2
  • what happens if the date in df1 > df2 ? Commented Jun 15, 2020 at 0:04
  • Default factor would be 1. So it would be multiplied by 1 Commented Jun 15, 2020 at 7:30

3 Answers 3

2

Make sure your date dtype is valid datetime else use pd.to_datetime. Then you can use pd.Series.to_numpy and use broadcasting here to compare and build a boolean array for boolean indexing. Then use pd.Series.map, use np.prod to get product of the whole array.

mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy() # `_.values` should be avoided instead use `_.to_numpy()`
it = iter(mask)

def mul(x):
    val = np.prod(df2.loc[next(it),'factor'])
    return x*val

df1['factored_qty'] = df1['qty'].map(mul)
df1
        date  qty  factored_qty
0 2016-10-08    1             6
1 2016-11-08    8            48
2 2016-12-08    2             6
3 2017-01-08    4            12

OR

mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy()
l = [np.prod(df2.loc[idx,'factor']) for idx in mask]
     # df2.loc[idx,'factor'].prod()
df['factored_qty'] = df1.qty.mul(l)

Timeit analysis:

# My answer
In [163]: %%timeit
     ...: mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy() # `_.values` should be avoided instead use `_.to_num
     ...: py()`
     ...: it = iter(mask)
     ...: 
     ...: def mul(x):
     ...:     val = np.prod(df2.loc[next(it),'factor'])
     ...:     return x*val
     ...: df1['factored_qty'] = df1['qty'].map(mul)
     ...:
     ...:
1.31 ms ± 28.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Stef's answer.
In [164]: %%timeit
     ...: df1['factored_qty'] = df1.apply(lambda x: df2[df2.date>x.date].factor.cumprod().values[-1] * x.qty,axis=1)
     ...:
     ...:
3.65 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#sammy's answer
In [180]: %%timeit
     ...: d = defaultdict(list)
     ...: #iterate through data and append factors that meet criteria
     ...: for (date1, qty), (date2, factor) in product(df1.to_numpy(),df2.to_numpy()) : 
     ...:     if date1 < date2 :
     ...:         d[(date1, qty)].append(factor)
     ...: outcome = {k:[s,np.prod((s,*v))] for (k,s),v in d.items()}
     ...: pd.DataFrame.from_dict(outcome, orient='index', columns=['qty','factored_qty']).reset_index()
     ...:
     ...:
1.49 ms ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Sign up to request clarification or add additional context in comments.

Comments

2

Probably not the fastest solution for large dataframes but it works. We use prod on all rows of df2 that meet the condition.

df1['factored_qty'] = df1.apply(lambda x: df2[df2.date>x.date].factor.prod() * x.qty,axis=1)

Result:

         date  qty  factored_qty
0  2016-10-08    1             6
1  2016-11-08    8            48
2  2016-12-08    2             6
3  2017-01-08    4            12


Update
For larger dataframes we can use merge_asof. We calculate the reverse cumprod, i.e. from last to first row. Unfortunately it becomes a bit convoluted if the last date in df2 is less then the last date in df1 as we have to add a sentinel to df2 (maximum date of df1 with factor 1) in this case.
This method is significantly faster than Ch3steR's and sammywemmy's solutions.

df3 = pd.merge_asof(df1.assign(date=pd.to_datetime(df1.date)),
                    df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]) if(df1.date.max()<df2.date.max())
                        else df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]).append({'date': pd.to_datetime(df1.date.max()), 'factor': 1}, ignore_index=True),
                    'date',
                    direction='forward')
df3.factor *= df3.qty
df3.rename(columns={'factor': 'factored_qty'}, inplace=True)


TIMING for larger dataframes (df1 200 rows, df2 100 rows

import pandas as pd
import numpy as np
n = 100
np.random.seed(0)
df1_ = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(200*n, 2*n, False))[::-1]],
                    'qty': np.random.randint(1, 20, 2*n)})
df2_ = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(100*n, n, False))[::-1]],
                    'factor': np.random.randint(1, 10, n)})

def setup():
    global df1, df2
    df1 = df1_.copy(True)
    df2 = df2_.copy(True)

def method_apply():
    df1['factored_qty'] = df1.apply(lambda x: df2[df2.date>x.date].factor.prod() * x.qty,axis=1)
    return df1

def method_merge():
    df3 = pd.merge_asof(df1.assign(date=pd.to_datetime(df1.date)),
                        df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]) if(df1.date.max()<df2.date.max())
                        else df2.assign(date=pd.to_datetime(df2.date), factor=df2.factor.iloc[::-1].cumprod().iloc[::-1]).append({'date': pd.to_datetime(df1.date.max()), 'factor': 1}, ignore_index=True),
                        'date',
                        direction='forward')
    df3.factor *= df3.qty
    df3.rename(columns={'factor': 'factored_qty'}, inplace=True)
    return df3

from itertools import product
from collections import defaultdict
def method_dict():
    d = defaultdict(list)
    df1['date'] = pd.to_datetime(df1['date'])
    df2['date'] = pd.to_datetime(df2['date'])
    for (date1, qty), (date2, factor) in product(df1.to_numpy(),df2.to_numpy()) : 
        if date1 < date2 : 
            d[(date1, qty)].append(factor)
    outcome = {k:[s,np.prod((s,*v))] for (k,s),v in d.items()}
    return pd.DataFrame.from_dict(outcome, orient='index', columns=['qty','factored_qty']).reset_index()

def method_numpy():
    mask = df1.date.to_numpy()[:,None] < df2.date.to_numpy()
    it = iter(mask)
    def mul(x):
        val = np.prod(df2.loc[next(it),'factor'])
        return x*val

    df1['factored_qty'] = df1['qty'].map(mul)
    return df1

Results:

method_apply    220   ms ± 5.99 ms per loop
method_numpy     86.7 ms ± 2.51 ms per loop
method_dict      80.7 ms ± 436 µs per loop
method_merge      8.87 ms ± 68.1 µs per loop

Depending on the random factors in df2 their product may lead to an overflow, this was ignored here. method_dict only works correctly if the last date in df2 is greater than that of df1, this was also ignored for the timings.

Comments

1

Convert to datetimes :

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

Move computation to dictionaries; I'd like to believe computations such as this are faster within dictionaries - it is an assumption; hopefully someone runs tests and proves or debunks this.

from itertools import product
from collections import defaultdict
d = defaultdict(list)
#iterate through data and append factors that meet criteria
for (date1, qty), (date2, factor) in product(df1.to_numpy(),df2.to_numpy()) : 
    if date1 <= date2 : 
        d[(date1, qty)].append(factor)
    else:
        d[(date1, qty)].append(1)

Let's see the contents of d:

print(d)

defaultdict(list,
            {(Timestamp('2016-10-08 00:00:00'), 1): [2, 3],
             (Timestamp('2016-11-08 00:00:00'), 8): [2, 3],
             (Timestamp('2016-12-08 00:00:00'), 2): [1, 3],
             (Timestamp('2017-01-08 00:00:00'), 4): [1, 3]})

Get the product of the filtered data with the quantity :

outcome = {k:[s,np.prod((s,*v))] for (k,s),v in d.items()}

Create dataframe :

pd.DataFrame.from_dict(outcome, orient='index', columns=['qty','factored_qty'])

           qty  factored_qty
2016-10-08  1       6
2016-11-08  8       48
2016-12-08  2       6
2017-01-08  4       12

6 Comments

Added timeit analysis of your answer. it took 1.49ms, stef's 3.65ms and mine 1.31ms. Nice answer though +1.
thanks for the timing... that was useful information.
Thank you, I added _.reset_index at the end to give a fair comparison. As OP's output also has 3 columns.
there seems to be a bug in this method in the general case, for instance for n = 100 np.random.seed(1) df1 = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(200*n, 2*n, False))[::-1]], 'qty': np.random.randint(1, 20, 2*n)}) df2 = pd.DataFrame({'date': [(pd.Timestamp('2020-06-01') - pd.Timedelta(x,'D')).strftime('%Y-%m-%d') for x in np.sort(np.random.choice(100*n, n, False))[::-1]],'factor': np.random.randint(1, 10, n)}) this method return 194 instead of 200 rows.
This error depends on the actual data: it occurs if the last date of df2 is less than the last date of df1.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.