0

I need to sum all values from maindata using master_records. Many values for ids will not get summed even if there are timestamps and values for these columns.

import pandas as pd
 
#Proxy reference dataframe
master_records = [['site a', '2021-03-05 02:00:00', '2021-03-05 03:00:00'], 
        ['site a', '2021-03-05 06:00:00', '2021-03-05 08:00:00'], 
        ['site b', '2021-04-08 10:00:00', '2021-04-08 13:00:00']] 
 
mst_df = pd.DataFrame(master_records, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)


#Proxy main high frequency dataframe
main_data = [['id a','2021-03-05 00:00:00', 10], #not aggregated
        ['id a','2021-03-05 01:00:00', 19], #not aggregated
        ['id a','2021-03-05 02:00:00', 9], 
        ['id a','2021-03-05 03:00:00', 16],
        ['id a','2021-03-05 04:00:00', 16], #not aggregated
        ['id a','2021-03-05 05:00:00', 11], #not aggregated
        ['id a','2021-03-05 06:00:00', 16], 
        ['id a','2021-03-05 07:00:00', 12], 
        ['id a','2021-03-05 08:00:00', 9], 
        ['id b','2021-04-08 10:00:00', 11], 
        ['id b','2021-04-08 11:00:00', 10], 
        ['id b','2021-04-08 12:00:00', 19], 
        ['id b','2021-04-08 13:00:00', 10], 
        ['id b','2021-04-08 14:00:00', 16]] #not aggregated
 
# Create the pandas DataFrame
maindata = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
maindata['timestamp'] = pd.to_datetime(maindata['timestamp'], infer_datetime_format=True)

The desired DataFrame looks like:

print(mst_df)
id  start   end                                      sum(value)
0   site a  2021-03-05 02:00:00 2021-03-05 03:00:00  25
1   site a  2021-03-05 06:00:00 2021-03-05 08:00:00  37
2   site b  2021-04-08 10:00:00 2021-04-08 13:00:00  50

1 Answer 1

3

The "id"s don't match; so first we create a column in both DataFrames to get a matching ID; then merge on the matching "id"s; then filter the merged DataFrame on the rows where the timestamps are between "start" and "end". Finally groupby + sum will fetch the desired outcome:

maindata['id_letter'] = maindata['id'].str.split().str[-1]
mst_df['id_letter'] = mst_df['id'].str.split().str[-1]
merged = mst_df.merge(maindata, on='id_letter', suffixes=('','_'))
out = (merged[merged['timestamp'].between(merged['start'], merged['end'])]
       .groupby(['id','start','end'], as_index=False)['value'].sum())

Output:

       id               start                 end  value
0  site a 2021-03-05 02:00:00 2021-03-05 03:00:00     25
1  site a 2021-03-05 06:00:00 2021-03-05 08:00:00     37
2  site b 2021-04-08 10:00:00 2021-04-08 13:00:00     50
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.