Sorting datetime objects by hour to a Pandas dataframe, then visualize to histogram with Matplotlib

Question

I need to sort viewers by hour to a histogram. I have some experience using Matplotlib to do that, but I can't find out what is the most pragmatic way to sort the dates by hour.

First I read the data from a JSON file, then store the two relevant datatypes in a pandas Dataframe, like this:

data = pd.read_json('data/data.json')

session_duration = pd.to_datetime(data.session_duration, unit='s').dt.time
time = pd.to_datetime(data.time, format='%H:%M:%S').dt.time

viewers = []

for x, y in zip(time, session_duration):
    viewers.append({str(x):str(y)})

EDIT: The source file looks like this, leaving out the irrelevant parts.

{
    "time": "00:00:09",
    "session_duration": 91
},
{
    "time": "00:00:16",
    "session_duration": 29
},
{
    "time": "00:00:33",
    "session_duration": 102
},
{
    "time": "00:00:35",
    "session_duration": 203
}

Note that the session_duration is in seconds.

I have to distinguish two types of viewers:

Those who spent <= 1 minutes on the stream
Those who spent >= 1 minutes on the stream

For that I do:

import datetime
for element in viewers:
    for time, session_duration in element.items():
        if datetime.strptime(session_duration, '%H:%M:%S').time() >= datetime.strptime('00:01:00', '%H:%M:%S').time():
            viewers_more_than_1min.append(element)
        else:
            viewers_less_than_1min.append(element)

As a result I have my values in a dictionary like this: {session_duration:time} Where the key is the time when the session ended the stream and the value is the time spent watching.

[{'00:00:09': '00:01:31'},
 {'00:00:16': '00:00:29'},
 {'00:00:33': '00:01:42'},
 {'00:00:35': '00:03:23'},
 {'00:00:36': '00:00:32'},
 {'00:00:37': '00:04:47'},
 {'00:00:47': '00:00:42'},
 {'00:00:53': '00:00:44'},
 {'00:00:56': '00:00:28'},
 {'00:00:58': '00:01:17'},
 {'00:01:04': '00:01:16'},
 {'00:01:09': '00:00:46'},
 {'00:01:29': '00:01:07'},
 {'00:01:31': '00:01:02'},
 {'00:01:32': '00:01:01'},
 {'00:01:32': '00:00:36'},
 {'00:01:37': '00:03:03'},
 {'00:01:49': '00:00:57'},
 {'00:02:01': '00:02:15'},
 {'00:02:18': '00:01:16'}]

As a final step I wish to create a histogram withMatplotlib representing the viewercount for each our from the two viewertypes mentioned above per hour. I assume it would go something like this:

import matplotlib.pyplot as plt
import datetime as dt
hours = [(dt.time(i).strftime('%H:00')) for i in range(24)]
plt.xlabel('Hour')
plt.ylabel('Viewer count')
plt.bar(hours, sorted_viewcount_byhour)

You can do all of this just using DataFrame methods. You shouldn't be using loops here at all. Can you post some sample json data so we can recreate your output? Please keep the sample data very small and only include relevant columns. — Dan
– Dan, Commented May 8, 2019 at 13:34
Thanks Dan for your rapid answer! I have added the source JSON. — midi
– midi, Commented May 8, 2019 at 13:47
Thanks for the source JSON, but note that I said to leave out irrelevant columns. You only have two relevant features here, time and session_duration, so remove the others. Also, we clearly need more than one single record to make a histogram. I suggest you edit again to improve that sample source json. — Dan
– Dan, Commented May 8, 2019 at 13:51
I have updated the source file again, thanks for the suggestions! — midi
– midi, Commented May 8, 2019 at 14:20
Last suggestion on the source JSON, think about what output that input would give. All your data occur in the same hour, so your histogram would be a single bar. Considering providing data that produce an output that really tests if the method has worked. In this case you will want multiple hours, and with a different number of counts in each so that the histogram looks like a histogram. — Dan
– Dan, Commented May 8, 2019 at 14:26

Dan · Accepted Answer · 2019-05-08 14:21:48Z

1

df = pd.read_json('data/data.json')

df['time'] = pd.to_datetime(df['time'])
#timedelta is a more appropriate data type for session_duration
df['session_duration'] = pd.to_timedelta(df['session_duration'], unit='s')

# Example filtering
df_short_duration = df[df['session_duration'].dt.total_seconds() <= 60]

# Example creating histogram
df_hist = df_short_duration.groupby(df['time'].dt.hour).count()
# Now just plot df_hist as a bar chart using matplotlib, might be something like plt.bar(df_hist.index, df_hist['count'])

edited May 8, 2019 at 14:21

answered May 8, 2019 at 13:48

Dan

45.8k20 gold badges98 silver badges170 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

midi Over a year ago

It seems like after being converted to Dataframe the df_short_duration['session_duration'] outputs like this: 00:00:00.000000

Dan Over a year ago

that's why I needed to see the source data :) try adding in unit='s' as in the edit above

midi Over a year ago

Thank you, it seems to work! :) A lot of useful pandas syntax have been absorbed today. Another quick question related to the plt.bar() I need to widen it, I have found that I do that by plt.subplots(figsize=(20, 5)) but this creates a new object instead of modifying the existing object that has attributes like title and labels. How should I go if I want the hours to fit nicely on the x axis ?

Dan Over a year ago

You should really ask this as a new question. But this is how I do it: fig, ax = plt.subplots(figsize=(20, 5)) and then instead of plt.bar(...) do ax.bar(...)

midi Over a year ago

Thanks Dan! I have opened a new question as I still am struggling with outputting more than two bars to one X-axis value. The thread is linked with this one.

|

Collectives™ on Stack Overflow

Sorting datetime objects by hour to a Pandas dataframe, then visualize to histogram with Matplotlib

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related