3

Im new to python and pandas and have some basic question about how to write a short function which takes a pd.Dataframe and returns relative values grouped by month.

Example data:

import pandas as pd
from datetime import datetime
import numpy as np

date_rng = pd.date_range(start='2019-01-01', end='2019-03-31', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['value_in_question'] = np.random.randint(0,100,size=(len(date_rng)))
df.set_index('date',inplace=True)
df.head()

       value_in_question
date    
2019-01-01  40
2019-01-02  86
2019-01-03  46
2019-01-04  75
2019-01-05  35

def absolute_to_relative(df):
    """
    set_index before using
    """
    return df.div(df.sum(), axis=1).mul(100)

relative_df = absolute_to_relative(df)      

relative_df.head()

       value_in_question
date    
2019-01-01  0.895055
2019-01-02  1.924368
2019-01-03  1.029313
2019-01-04  1.678228
2019-01-05  0.783173

Rather than taking the column sum and devide each row by that, I would like to have the sum groupby each month. The final df should have the same shape and form but the row values relate to sum of the month.

old:

             value_in_question
date
"2019-01-01" value/colum_sum * 100

new:

            value_in_question
date
"2019-01-01" value/month_sum * 100

So I tried the following, which returns NA for value_in_question:

def absolute_to_relative_agg(df, agg):
    """
    set_index before using
    """
    return df.div(df.groupby([pd.Grouper(freq=agg)]).sum(), axis=1)

relative_df = absolute_to_relative(df, 'M')

      value_in_question
date    
2019-01-01  NaN
2019-01-02  NaN
2019-01-03  NaN
2019-01-04  NaN
2019-01-05  NaN

3 Answers 3

3

Use GroupBy.transform instead aggregation for Series/DateFrame with same DatatimeIndex like original, so possible division:

def absolute_to_relative_agg(df, agg):
    """
    set_index before using
    """
    return df.div(df.groupby([pd.Grouper(freq=agg)]).transform('sum'))

relative_df = absolute_to_relative_agg(df, 'M')

Another way for call function is DataFrame.pipe:

relative_df = df.pipe(absolute_to_relative_agg, 'M')

print (relative_df)
           value_in_question
date                         
2019-01-01           0.032901
2019-01-02           0.045862
2019-01-03           0.048853
2019-01-04           0.008475
2019-01-05           0.041376
                      ...
2019-03-27           0.062049
2019-03-28           0.002165
2019-03-29           0.048341
2019-03-30           0.007937
2019-03-31           0.015152

[90 rows x 1 columns]
Sign up to request clarification or add additional context in comments.

Comments

0

For the sums, you can groupby the index month:

In [31]: month_sum = df.groupby(df.index.strftime('%Y%m')).sum()
    ...: month_sum
    ...:
Out[31]:
        value_in_question
201901               1386
201902               1440
201903               1358

You can then use .map to align the month with the correct rows of your DataFrame:

In [32]: map_sum = df.index.strftime('%Y%m').map(month_sum['value_in_question'])
    ...: map_sum
    ...:
Out[32]:
Int64Index([1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386,
            1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386,
            1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1386, 1440, 1440,
            1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440,
            1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440, 1440,
            1440, 1440, 1440, 1440, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
            1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
            1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358, 1358,
            1358, 1358],
           dtype='int64')

Then you just need to do the division:

In [33]: df['value_in_question'].div(map_sum)
Out[33]:
date
2019-01-01    0.012987
2019-01-02    0.018759
2019-01-03    0.000000
2019-01-04    0.056277
2019-01-05    0.019481
                ...
2019-03-27    0.031664
2019-03-28    0.007364
2019-03-29    0.050074
2019-03-30    0.033873
2019-03-31    0.005155
Name: value_in_question, Length: 90, dtype: float64

Comments

0

Use Grouper with freq='M'.

The code is:

relative_df = df.groupby(pd.Grouper(freq='M'))\
    .value_in_question.apply(lambda x: x.div(x.sum()).mul(100))

It returns a Series with index the same like in original DataFrame and values equal to relative value_in_question for the current month.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.