0

I have a pandas dataframe (stored in a .csv file) with the following format.

val,date,time
0.001,01JAN90,0:00:00
0.002,01JAN90,0:01:00
0.005,01JAN90,0:02:00
0.056,01JAN90,0:03:00
...
0.067,31DEC90,23:55:00
0.007,31DEC90,23:56:00
0.006,31DEC90,23:57:00
0.004,31DEC90,23:58:00
0.003,31DEC90,23:59:00

This is: a single float (val column) for each minute (time column) of each day (time column) in a year. I need to group the val elements throughout the entire year, that belong to a given hour range. I define 15 hour ranges as:

t_range = [['5:30:00', '6:30:00'], ['6:30:00', '7:30:00'], ...,
['19:30:00', '20:30:00']]

The answer given here Pandas Groupby Range of Values deals with ranges defined as floats, but my ranges are defined as strings.

My idea is that I'd need to first convert all the HH:MM:SS values in time to floats, and then apply the solution based on groupby and pd.cut. Is this the proper approach? How should I be using pandas to do this if not?

4
  • Can you be sure that you have data at minutes frequency? Commented Apr 21, 2017 at 17:48
  • You mean if any rows are missing? I'm sure there are not, the length of the columns is exactly (N days in year)*(24 hs)*(60 mins) Commented Apr 21, 2017 at 17:50
  • Then I guess a simple approach is to rely on the number of observations (1h = 60 obs). If you define a dummy variable "count" 1,2,3... you can use modulo to understand in which range you are and then use group bt Commented Apr 21, 2017 at 17:57
  • You are correct, but I'd prefer a general solution that does not make this assumption. Commented Apr 21, 2017 at 18:22

1 Answer 1

3

IIUC you can do it this way:

start = 5*60+30
end = 20*60+30
step = 60

df['ts'] = pd.to_datetime(df.date + ' ' + df.time, format='%d%b%y %H:%M:%S')
df['mins'] = df.ts.dt.hour*60 + df.ts.dt.minute

# filter out all "non-interesting" entries                           
x = df.query("@start <= mins <= @end")

bins = np.arange(start-step, end+step, step)
labels = ['({0[0]:02d}:{0[1]:02d}:00, {0[0]:02d}:{0[1]:02d}:00]'.format(divmod(x,60),
                                                                        divmod(x+step,60))
          for x in bins[:-1]]


x.groupby(pd.cut(x['mins'], bins=bins, labels=labels))['val'].sum().dropna()

Result:

In [164]: x.groupby(pd.cut(x['mins'], bins=bins, labels=labels))['val'].sum().dropna()
Out[164]:
mins
(05:30:00, 06:30:00]    0.006
(06:30:00, 07:30:00]    0.004
(07:30:00, 08:30:00]    0.003
(08:30:00, 09:30:00]    0.111
(09:30:00, 10:30:00]    0.001
(10:30:00, 11:30:00]    0.002
(11:30:00, 12:30:00]    0.005
(12:30:00, 13:30:00]    0.056
Name: val, dtype: float64

Source DF:

In [166]: df
Out[166]:
     val     date      time
0  0.067  01DEC90  04:00:00
1  0.007  01DEC90  05:00:00
2  0.006  01DEC90  06:00:00
3  0.004  01DEC90  07:00:00
4  0.003  01DEC90  08:00:00
5  0.111  01DEC90  09:00:00
6  0.001  01JAN90  10:00:00
7  0.002  01JAN90  11:00:00
8  0.005  01JAN90  12:00:00
9  0.056  01JAN90  13:00:00

Explanation:

bins: # of minutes

In [181]: bins
Out[181]: array([ 270,  330,  390,  450,  510,  570,  630,  690,  750,  810,  870,  930,  990, 1050, 1110, 1170, 1230])

labels

In [182]: labels
Out[182]:
['(04:30:00, 04:30:00]',
 '(05:30:00, 05:30:00]',
 '(06:30:00, 06:30:00]',
 '(07:30:00, 07:30:00]',
 '(08:30:00, 08:30:00]',
 '(09:30:00, 09:30:00]',
 '(10:30:00, 10:30:00]',
 '(11:30:00, 11:30:00]',
 '(12:30:00, 12:30:00]',
 '(13:30:00, 13:30:00]',
 '(14:30:00, 14:30:00]',
 '(15:30:00, 15:30:00]',
 '(16:30:00, 16:30:00]',
 '(17:30:00, 17:30:00]',
 '(18:30:00, 18:30:00]',
 '(19:30:00, 19:30:00]']
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.