Grouping and Aggregating on multiple time series

Question

Im new to python and pandas and have some basic question about how to write a short function which takes a pd.Dataframe and returns relative values grouped by month.

Example data:

import pandas as pd
from datetime import datetime
import numpy as np

date_rng = pd.date_range(start='2019-01-01', end='2019-03-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['value_in_question'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index('date',inplace=True)
df.head()

       value_in_question
date    
2019-01-01  40
2019-01-02  86
2019-01-03  46
2019-01-04  75
2019-01-05  35

def absolute_to_relative(df):
    """
    set_index before using
    """
    return df.div(df.sum(), axis=1).mul(100)

relative_df = absolute_to_relative(df)      

relative_df.head()

       value_in_question
date    
2019-01-01  0.895055
2019-01-02  1.924368
2019-01-03  1.029313
2019-01-04  1.678228
2019-01-05  0.783173

Rather than taking the column sum and devide each row by that, I would like to have the sum groupby each month. The final df should have the same shape and form but the row values relate to sum of the month.

old:

             value_in_question
date
"2019-01-01" value/colum_sum * 100

new:

            value_in_question
date
"2019-01-01" value/month_sum * 100

So I tried the following, which returns NA for value_in_question:

def absolute_to_relative_agg(df, agg):
    """
    set_index before using
    """
    return df.div(df.groupby([pd.Grouper(freq=agg)]).sum(), axis=1)

relative_df = absolute_to_relative(df, 'M')

      value_in_question
date    
2019-01-01  NaN
2019-01-02  NaN
2019-01-03  NaN
2019-01-04  NaN
2019-01-05  NaN

jezrael · Accepted Answer · 2019-11-30 16:58:23Z

3

Use GroupBy.transform instead aggregation for Series/DateFrame with same DatatimeIndex like original, so possible division:

def absolute_to_relative_agg(df, agg):
    """
    set_index before using
    """
    return df.div(df.groupby([pd.Grouper(freq=agg)]).transform('sum'))

relative_df = absolute_to_relative_agg(df, 'M')

Another way for call function is DataFrame.pipe:

relative_df = df.pipe(absolute_to_relative_agg, 'M')

print (relative_df)
           value_in_question
date                         
2019-01-01           0.032901
2019-01-02           0.045862
2019-01-03           0.048853
2019-01-04           0.008475
2019-01-05           0.041376
                      ...
2019-03-27           0.062049
2019-03-28           0.002165
2019-03-29           0.048341
2019-03-30           0.007937
2019-03-31           0.015152

[90 rows x 1 columns]

edited Nov 30, 2019 at 16:58

answered Nov 30, 2019 at 16:44

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Randy · Accepted Answer · 2019-11-30 16:50:39Z

For the sums, you can groupby the index month:

In [31]: month_sum = df.groupby(df.index.strftime('%Y%m')).sum()
    ...: month_sum
    ...:
Out[31]:
        value_in_question
201901               1386
201902               1440
201903               1358

You can then use .map to align the month with the correct rows of your DataFrame:

In [32]: map_sum = df.index.strftime('%Y%m').map(month_sum['value_in_question'])
    ...: map_sum
    ...:
Out[32]:
Int64Index([1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386,
            1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386,
            1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1440, 1440,
            1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440,
            1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440,
            1440, 1440, 1440, 1440, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
            1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
            1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
            1358, 1358],
           dtype='int64')

Then you just need to do the division:

In [33]: df['value_in_question'].div(map_sum)
Out[33]:
date
2019-01-01    0.012987
2019-01-02    0.018759
2019-01-03    0.000000
2019-01-04    0.056277
2019-01-05    0.019481
                ...
2019-03-27    0.031664
2019-03-28    0.007364
2019-03-29    0.050074
2019-03-30    0.033873
2019-03-31    0.005155
Name: value_in_question, Length: 90, dtype: float64

Valdi_Bo · Accepted Answer · 2019-11-30 17:42:40Z

0

Use Grouper with freq='M'.

The code is:

relative_df = df.groupby(pd.Grouper(freq='M'))\
    .value_in_question.apply(lambda x: x.div(x.sum()).mul(100))

It returns a Series with index the same like in original DataFrame and values equal to relative value_in_question for the current month.

edited Nov 30, 2019 at 17:42

answered Nov 30, 2019 at 17:37

Valdi_Bo

31.1k4 gold badges29 silver badges45 bronze badges

Collectives™ on Stack Overflow

Grouping and Aggregating on multiple time series

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related