2

I have two dataframes (simplified examples below). One contains a series of dates and values (df1), the second contains a date range (df2). I would like to identify/select/mask the date range from df2 in df1, sum the associated df1 values and add them to a new column in df2.

I'm a novice and all the techniques I have tried have been unsuccessful--a combination of wrong method, combining incompatible methods, syntax errors and so on. I have searched the Q&As here, but none have quite addressed this issue.

import pandas as pd
#********** df1: dates and values ***********
rng = pd.date_range('2012-02-24', periods=12, freq='D')
df1 = pd.DataFrame({ 'STATCON': ['C00028', 'C00489', 'C00038', 'C00589', 'C10028', 'C00499', 'C00238', 'C00729',
                            'C10044', 'C00299', 'C00288', 'C00771'],
'Date': rng,
'Val': [0.96, 0.57, 0.39, 0.17, 0.93, 0.86, 0.54, 0.58, 0.43, 0.19, 0.40, 0.32]
                 })

#********** df2: date range ***********
df2 = pd.DataFrame({
                'BCON': ['B002', 'B004', 'B005'],
                'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
                'End': ['2012-02-29', '2012-03-04', '2012-03-06']
               })
df2[['Start','End']] = df2[['Start','End']].apply(pd.to_datetime)

#********** Desired Output:  df2 -- date range with summed values ***********
df3 = pd.DataFrame({
                'BCON': ['B002', 'B004', 'B005'],
                'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
                'End': ['2012-02-29', '2012-03-04', '2012-03-06'],
                'Sum_Val': [2.92, 3.53, 2.46]
               })

2 Answers 2

2

You can solve this with the Dataframe.apply function as follow:

def to_value(row):
  return df1[(row['Start'] <= df1['Date']) & (df1['Date'] <= row['End'])]['Val'].sum()

df3 = df2.copy()
df3['Sum_Val'] = df3.apply(to_value, axis=1)

The to_value function is called on every row of the df3 dataframe.

See here for a live implementation of the solution: https://1000words-hq.com/n/TcYN1Fz6Izp

Sign up to request clarification or add additional context in comments.

2 Comments

I had been trying various forms of your return statement before I asked for help in the forum. In my iterations I thought I would need to reference df2, as in df2['Start'] <= etc., but you do it with (row) in the function. How come df2 doesn't have to be explicitly stated?
During the d3.apply call, the to_value function is called sequentially with every row of d3. And since d3 is initialized with d2.copy(), it's essentially the rows in df2. You can test how the to_value function works by calling: to_value(df2.iloc[0,:])
0

One option is with conditional_join from pyjanitor - it uses a binary search, which is usually better than a naive sequential scan (which can be memory consuming, depending on the data size):

# pip install pyjanitor
import pandas as pd
import numpy as np

df1['Date'] = pd.to_datetime(df1.Date)

(df1
.conditional_join(
    df2, 
    ('Date', 'Start', '>='), 
    ('Date', 'End', '<='), 
    how = 'right')
.groupby(df2.columns.tolist(), sort = False)
.Val
.sum()
)
BCON  Start       End       
B002  2012-02-25  2012-02-29    2.92
B004  2012-02-28  2012-03-04    3.53
B005  2012-03-01  2012-03-06    2.46
Name: Val, dtype: float64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.