Converting Pandas Series into Dataframe customized values

Question

start = pd.to_datetime("2017-02-21 22:32:41",infer_datetime_format=True)
end = pd.to_datetime("2017-02-22 01:32:41",infer_datetime_format=True)
rng = pd.date_range(start.floor('h'), end.floor('h'), freq='h')
left = pd.Series(rng, index=rng ).clip_lower(start)
right = pd.Series(rng + 1, index=rng).clip_upper(end)
s = right - left

I get result as

2017-02-21 22:00:00   00:27:19
2017-02-21 23:00:00   01:00:00
2017-02-22 00:00:00   01:00:00
2017-02-22 01:00:00   00:32:41

I want to convert the result pandas.Series to dataframe to make my result shown below

id    |hour|day|minute|
+-----+----+---+------+
|10001|  22|Wed|     27|
|10001|  23|Thu|     60|
|10001|  00|Thu|     60|
|10001|  01|Thu|     32|

Any direct conversion option or should I have to loop through it ?

just wanna confirm in your output 23 hour and 60 minutes correct? — Sociopath
– Sociopath, Commented Mar 10, 2018 at 5:11
@AkshayNevrekar Yes Im creating time series between two given dates with hour frequency. Yes missed ID part of code snippet ID will be another parameter into the function. — syv
– syv, Commented Mar 10, 2018 at 5:24
Just to confirm, you are using some data from the index and other from the values? — Alex
– Alex, Commented Mar 10, 2018 at 5:44
I think you can use Series.dt. pandas.pydata.org/pandas-docs/stable/… — Lambda
– Lambda, Commented Mar 10, 2018 at 5:45

piRSquared · Accepted Answer · 2018-03-10 08:25:47Z

5

Option 1
Using np.core.defchararray.split after using strftime
Followed up with assign after using floor division on number of seconds

pd.DataFrame(
    np.core.defchararray.split(s.index.strftime('%H %a')).tolist(),
    columns=['hour', 'day']
).assign(minute=(s.dt.seconds // 60).values)

  hour  day  minute
0   22  Tue      27
1   23  Tue      60
2   00  Wed      60
3   01  Wed      32

Option 2
Using a dictionaries in a list comprehension.
Note that I use Python 3.6 f-strings.
Otherwise use '{:02d}'.format(i.hour)

pd.DataFrame([dict(
    hour=f'{i.hour:02d}',
    day=i.strftime('%a'),
    minute=v.seconds // 60
) for i, v in s.items()])

   day  hour  minute
0  Tue    22      27
1  Tue    23      60
2  Wed    00      60
3  Wed    01      32

Option 3
And since the topic of speed came up, I wanted to offer another option that considers that.

a = np.array('Mon Tue Wed Thu Fri Sat Sun'.split())

pd.DataFrame(dict(
    hour=s.index.hour.astype(str).str.zfill(2),
    day=a[s.index.weekday],
    minute=s.values.astype('timedelta64[m]').astype(int)
))

   day  hour  minute
0  Tue    22      27
1  Tue    23      60
2  Wed    00      60
3  Wed    01      32

Complete Time Test

Note: I altered functions to ensure output was identical. Namely focusing on getting column order correct and Hour column as a string.

Functions

def jez(s):
    a = s.index.strftime('%H')
    b = s.index.strftime('%a')
    c = s.dt.floor('T').dt.total_seconds().div(60).astype(int)
    return pd.DataFrame({'hour':a,'day':b,'minute':c.values}, 
                        columns=['hour','day','minute'])

def pir1(s):
    return pd.DataFrame(
        np.core.defchararray.split(s.index.strftime('%H %a')).tolist(),
        columns=['hour', 'day']
    ).assign(minute=(s.dt.seconds // 60).values)

def pir2(s):
    return pd.DataFrame([dict(
        hour=f'{i.hour:02d}',
        day=i.strftime('%a'),
        minute=v.seconds // 60
    ) for i, v in s.items()], columns=['hour', 'day', 'minute'])

def pir3(s):
    a = np.array('Mon Tue Wed Thu Fri Sat Sun'.split())

    return pd.DataFrame(dict(
        hour=s.index.hour.astype(str).str.zfill(2),
        day=a[s.index.weekday],
        minute=s.values.astype('timedelta64[m]').astype(int)
    ), columns=['hour', 'day', 'minute'])

Back Test

res = pd.DataFrame(
    np.nan,
    [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'jez pir1 pir2 pir3'.split()
)

for i in res.index:
    start = pd.to_datetime("2007-02-21 22:32:41", infer_datetime_format=True)
    rng = pd.date_range(start.floor('h'), periods=i, freq='h')
    end = rng.max() + pd.to_timedelta("01:32:41")
    left = pd.Series(rng, index=rng).clip_lower(start)
    right = pd.Series(rng + 1, index=rng).clip_upper(end)
    s = right - left
    for j in res.columns:
        stmt = f'{j}(s)'
        setp = f'from __main__ import {j}, s'
        res.at[i, j] = timeit(stmt, setp, number=100)

Results

res.plot(loglog=True)

res.div(res.min(1), 0)

             jez       pir1       pir2      pir3
10      2.364757   1.922064   1.000000  1.124539
30      1.916160   2.092680   1.129115  1.000000
100     3.039881   3.361606   2.180457  1.000000
300     3.967504   5.025567   3.920143  1.000000
1000    7.106132   9.757840   7.607425  1.000000
3000   10.104004  14.741414  11.957978  1.000000
10000  10.522324  15.318158  13.262373  1.000000
30000  11.804760  16.718153  14.289628  1.000000

Conclusions

In the chart, you can see that jez, pir1 and pir2 are all grouped together when plotted in log space. This tells us that their time is growing at the same order of magnitude. However, pir3 has a large separation and gets bigger over larger data. The time complexity of pir3 is smaller and indicates a much larger advantage.

This becomes more clear as we look at the table of multiples. Each row has a lowest value of 1 which indicates the fastest time. All other values in that row are multiples of time it took to accomplish the same task. In other words. The larger the value, the slower the method. As you can see, those multiples get larger over larger data. That means the advantage of pir3 gets better and better.

That is what better looks like. It is pointless to boast of 25% time improvements. Unless you have order of magnitude improvements, it isn't worth trying to convince the readers that an algorithm or approach is "better".

edited Mar 10, 2018 at 8:25

answered Mar 10, 2018 at 6:57

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jezrael Over a year ago

It seems both solutions are slowier. I think the best is avoid loops in pandas ;)

piRSquared Over a year ago

@jezrael to be fair. In the grand scheme of things, over the size of dataframe you used to test, times of 142, 197, and 210 ms are the same order of magnitude and are all likely of similar time complexity. I wouldn't boast about marginal constant time improvements in execution time. Secondly, I'm using a loop in one solution and not the other. Third, what does best mean? Sometimes it is best to provide a solution that is intuitive and maintainable. And often that is a matter of opinion. I like my option 2 very much. I think it is elegant and obvious.

jezrael Over a year ago

hmmm, I only want to say general ruel in pandas - avoid loops. I timeit more your loop list comprehension solutions and always slowier. So it depends what need - more elegant or faster, but still elegant. And what is more or less elegant is still highly subjective ;)

jezrael Over a year ago

And second reason whoy dont use list comprehension are NaNs - all solutions failed.

piRSquared Over a year ago

Yes, f-strings are Python 3.6 and are awesome

|

jezrael · Accepted Answer · 2018-03-10 08:14:07Z

1

I think you need DatetimeIndex.strftime for hours and days of week and for minutes from timedeltas use Series.dt.floor + Series.dt.total_seconds:

a = s.index.strftime('%H')
b = s.index.strftime('%a')
c = s.dt.floor('T').dt.total_seconds().div(60).astype(int)
#alternative
#c = s.dt.total_seconds().floordiv(60).astype(int)
df = pd.DataFrame({'hour':a,'day':b,'minute':c.values}, 
                  columns=['hour','day','minute'])
print (df)
  hour  day  minute
0   22  Tue      27
1   23  Tue      60
2   00  Wed      60
3   01  Wed      32

edited Mar 10, 2018 at 8:14

answered Mar 10, 2018 at 5:53

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Collectives™ on Stack Overflow

Converting Pandas Series into Dataframe customized values

2 Answers 2

Complete Time Test

Functions

Back Test

Results

Conclusions

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Complete Time Test

Functions

Back Test

Results

Conclusions

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related