0

I have a table of data below: (the 1st column is date, the 2nd column is the daily return)

2020-01-02    0.022034
2020-01-03   -0.002666
2020-01-06    0.009716
2020-01-07    0.009838
2020-01-08   -0.011690
2020-01-09    0.025103
2020-01-10    0.009325
2020-01-13    0.028888
2020-01-14   -0.009183
2020-01-15    0.012292
2020-01-16   -0.005593
2020-01-17    0.020492
2020-01-20   -0.003878
2020-01-21   -0.032687
2020-01-22    0.034887
2020-01-23   -0.033485
2020-01-24    0.001934
2020-01-29   -0.026629
2020-01-30   -0.039513
2020-01-31   -0.001845
2020-02-03    0.021784
2020-02-04    0.033137
2020-02-05    0.000586
2020-02-06    0.016146
2020-02-07    0.000082
2020-02-10   -0.016997
2020-02-11    0.010172
2020-02-12    0.016836
2020-02-13    0.013530
                  ...
2022-01-31    0.031707
2022-02-04    0.028683
2022-02-07   -0.015853
2022-02-08   -0.024170
2022-02-09    0.045076
2022-02-10    0.013623
2022-02-11   -0.012259
2022-02-14   -0.023093
2022-02-15   -0.008984
2022-02-16    0.023177
2022-02-17    0.003182
2022-02-18   -0.054995
2022-02-21   -0.033302
2022-02-22   -0.028148
2022-02-23    0.012332
2022-02-24   -0.048095
2022-02-25   -0.004944
2022-02-28   -0.002682
2022-03-01    0.006940
2022-03-02    0.002542
2022-03-03   -0.006318
2022-03-04   -0.048641
2022-03-07   -0.050231
2022-03-08   -0.015469
2022-03-09    0.011477
2022-03-10   -0.002236
2022-03-11   -0.038740
2022-03-14   -0.115421
2022-03-15   -0.089573
2022-03-16    0.243084

I want to build a frequency table like below: enter image description here

I think this might involve several steps: (1) categorise daily return data into different ranges (2) use value_counts() on the ranges (3) calculate the percentage on the ranges

For the first step, I think I can try pd.cut with a groupby. However, my dataframe doesnt have a header, and I tried portret_df.columns = ['Dates','Daily Return'] but could not manage to add the header. May I ask how can I add a header there so that I can refer to the 1st and 2nd columns? Much appreciated for your help.

3
  • better show small example data, and first calculate result manually for this data, and show this result. And then we can see what result you really expect. With big data and image which doesn't match to this data we can't confirm if we create correct solution. Commented Mar 24, 2022 at 20:25
  • if you have problem to add headers then show FULL error message. I have no problem with your portret_df.columns = ['Dates','Daily Return']. Maybe you get it as single column and now it has problem to add two headers to single column. How did you get/read data? Commented Mar 24, 2022 at 20:34
  • maybe first check what you have in portret_df.columns and how many columns it shows. You can always select column by number. Commented Mar 24, 2022 at 20:35

2 Answers 2

2

Let's use pd.Series.value_counts with 'bins' parameter.

bins = [-np.inf,-.01,-.005,0,.005,.01,np.inf]
labels = ['ret < -1%', 
          '-1% < ret < -.5%', 
          '-.5% < ret < %0',
          ' %0 < ret < .5%',
          '.5% < ret < 1%',
          'reg > 1%']
df_counts = (df['ret'].value_counts(bins=bins, sort=False)
                      .rename('# of events').to_frame().set_axis(labels).T)
df_pcts = (df['ret'].value_counts(bins=bins, normalize=True, sort=False)
                    .rename('% of events').to_frame()
                    .set_axis(labels).T.mul(100).round(1))

pd.concat([df_counts, df_pcts])

Output:

             ret < -1%  -1% < ret < -.5%  -.5% < ret < %0   %0 < ret < .5%  .5% < ret < 1%  reg > 1%
# of events       20.0               4.0              6.0              5.0             4.0      20.0
% of events       33.9               6.8             10.2              8.5             6.8      33.9
Sign up to request clarification or add additional context in comments.

4 Comments

nice, never noticed value_counts had a bins param
@tdy Yeah... It is one of those often looked over parameters. :)
Thank you very much for your suggestion. For the code .rename('# of events').to_frame().set_axis(labels).T) It throw the error TypeError: set_axis() missing 1 required positional argument: 'labels . I wonder what exactly went wrong?
Older version of pandas. add inplace=False and axis=0(I think zero is rows). Or upgrade pandas. I am away from keyboard. Maybe someelse can edit this solution and test.
1

I have no idea why you have problem to change headers - maybe you read it as single column.


I have no problem to set headers when I load

df = pd.read_csv(..., names=['Date','Daily Return'])

or later

df.columns = ['Date','Daily Return']

And later I can use cut with bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val]

min_val = df['Daily Return'].min() - 1
max_val = df['Daily Return'].max() + 1

regions = pd.cut(df['Daily Return'],
                 bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val],
                 labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
                 )

And calculate number of events

count = regions.value_counts(sort=False)
print(count)
ret < -1(%)            0
-1 < ret < -0.5(%)     0
-0.5 < ret < 0(%)     30
0 < ret < 0.5(%)      29
0.5 < ret < 1(%)       0
ret > 1(%)             0

And use it to calculate percentage

size = len(regions)
percentage = (count/size) * 100
print(percentage)
ret < -1(%)            0.000000
-1 < ret < -0.5(%)     0.000000
-0.5 < ret < 0(%)     50.847458
0 < ret < 0.5(%)      49.152542
0.5 < ret < 1(%)       0.000000
ret > 1(%)             0.000000

Now it would need only to put all in DataFrame to format table

results = pd.DataFrame({'# of event': count, '% of event': percentage})

print(results.T.to_string())
            ret < -1(%)  -1 < ret < -0.5(%)  -0.5 < ret < 0(%)  0 < ret < 0.5(%)  0.5 < ret < 1(%)  ret > 1(%)
# of event          0.0                 0.0          30.000000         29.000000               0.0         0.0
% of event          0.0                 0.0          50.847458         49.152542               0.0         0.0

Full working code with example data readed with io.StringIO

but you should use own method to get data

text = '''2020-01-02    0.022034
2020-01-03   -0.002666
2020-01-06    0.009716
2020-01-07    0.009838
2020-01-08   -0.011690
2020-01-09    0.025103
2020-01-10    0.009325
2020-01-13    0.028888
2020-01-14   -0.009183
2020-01-15    0.012292
2020-01-16   -0.005593
2020-01-17    0.020492
2020-01-20   -0.003878
2020-01-21   -0.032687
2020-01-22    0.034887
2020-01-23   -0.033485
2020-01-24    0.001934
2020-01-29   -0.026629
2020-01-30   -0.039513
2020-01-31   -0.001845
2020-02-03    0.021784
2020-02-04    0.033137
2020-02-05    0.000586
2020-02-06    0.016146
2020-02-07    0.000082
2020-02-10   -0.016997
2020-02-11    0.010172
2020-02-12    0.016836
2020-02-13    0.013530
2022-01-31    0.031707
2022-02-04    0.028683
2022-02-07   -0.015853
2022-02-08   -0.024170
2022-02-09    0.045076
2022-02-10    0.013623
2022-02-11   -0.012259
2022-02-14   -0.023093
2022-02-15   -0.008984
2022-02-16    0.023177
2022-02-17    0.003182
2022-02-18   -0.054995
2022-02-21   -0.033302
2022-02-22   -0.028148
2022-02-23    0.012332
2022-02-24   -0.048095
2022-02-25   -0.004944
2022-02-28   -0.002682
2022-03-01    0.006940
2022-03-02    0.002542
2022-03-03   -0.006318
2022-03-04   -0.048641
2022-03-07   -0.050231
2022-03-08   -0.015469
2022-03-09    0.011477
2022-03-10   -0.002236
2022-03-11   -0.038740
2022-03-14   -0.115421
2022-03-15   -0.089573
2022-03-16    0.243084
'''

import pandas as pd
import io

df = pd.read_csv(io.StringIO(text), sep='\s+', names=['date', 'value'])
df.columns = ['Date','Daily Return']
#print(df)

min_val = df['Daily Return'].min() - 1
max_val = df['Daily Return'].max() + 1

regions = pd.cut(df['Daily Return'],
                 bins=[min_val, -1, -0.5, 0, 0.5, 1, max_val],
                 labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
                 )

count = regions.value_counts(sort=False)
print(count)

size = len(regions)
percentage = (count/size) * 100
print(percentage)

results = pd.DataFrame({'# of event': count, '% of event': percentage})

print(results.T.to_string())

EDIT:

As @tdy suggests in comment you can also use -np.inf, np.inf instead of min_val, max_val

import numpy as np

regions = pd.cut(df['Daily Return'],
                 bins=[-np.inf, -1, -0.5, 0, 0.5, 1, np.inf],
                 labels=['ret < -1(%)','-1 < ret < -0.5(%)', '-0.5 < ret < 0(%)','0 < ret < 0.5(%)','0.5 < ret < 1(%)','ret > 1(%)'],
                 )

6 Comments

Thank you very much for your reply and advice. I realised the problem was that portret_df is a "Series" object in my code, which is not a data frame. Therefore the column attribute didn't work.
+1, just a couple suggestions: you can use value_counts(normalize=True) to get proportions automatically, and you can avoid calculating the min/max bounds by using -np.inf and np.inf
Following up with my previous comment, I changed the series into dataframe with the following code: portretDF = portret_df.to_frame() and updated the header with portretDF.columns = ['Date', 'Return'] . However, an error message pops up: Length mismatch: Expected axis has 1 elements, new values have 2 elements
@tdy I was thinking about -np.inf and np.inf but finally I resigned :)
@silverpegasus in pd.cut(df['Daily Return'], ... it gets single column so it gets it as Series and as @tdy said you could use your series directly in pd.cut()
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.