4

I need to sort viewers by hour to a histogram. I have some experience using Matplotlib to do that, but I can't find out what is the most pragmatic way to sort the dates by hour.

First I read the data from a JSON file, then store the two relevant datatypes in a pandas Dataframe, like this:

data = pd.read_json('data/data.json')

session_duration = pd.to_datetime(data.session_duration, unit='s').dt.time
time = pd.to_datetime(data.time, format='%H:%M:%S').dt.time

viewers = []

for x, y in zip(time, session_duration):
    viewers.append({str(x):str(y)})

EDIT: The source file looks like this, leaving out the irrelevant parts.

{
    "time": "00:00:09",
    "session_duration": 91
},
{
    "time": "00:00:16",
    "session_duration": 29
},
{
    "time": "00:00:33",
    "session_duration": 102
},
{
    "time": "00:00:35",
    "session_duration": 203
}

Note that the session_duration is in seconds.

I have to distinguish two types of viewers:

  • Those who spent <= 1 minutes on the stream
  • Those who spent >= 1 minutes on the stream

For that I do:

import datetime
for element in viewers:
    for time, session_duration in element.items():
        if datetime.strptime(session_duration, '%H:%M:%S').time() >= datetime.strptime('00:01:00', '%H:%M:%S').time():
            viewers_more_than_1min.append(element)
        else:
            viewers_less_than_1min.append(element)

As a result I have my values in a dictionary like this: {session_duration:time} Where the key is the time when the session ended the stream and the value is the time spent watching.

[{'00:00:09': '00:01:31'},
 {'00:00:16': '00:00:29'},
 {'00:00:33': '00:01:42'},
 {'00:00:35': '00:03:23'},
 {'00:00:36': '00:00:32'},
 {'00:00:37': '00:04:47'},
 {'00:00:47': '00:00:42'},
 {'00:00:53': '00:00:44'},
 {'00:00:56': '00:00:28'},
 {'00:00:58': '00:01:17'},
 {'00:01:04': '00:01:16'},
 {'00:01:09': '00:00:46'},
 {'00:01:29': '00:01:07'},
 {'00:01:31': '00:01:02'},
 {'00:01:32': '00:01:01'},
 {'00:01:32': '00:00:36'},
 {'00:01:37': '00:03:03'},
 {'00:01:49': '00:00:57'},
 {'00:02:01': '00:02:15'},
 {'00:02:18': '00:01:16'}]

As a final step I wish to create a histogram withMatplotlib representing the viewercount for each our from the two viewertypes mentioned above per hour. I assume it would go something like this:

import matplotlib.pyplot as plt
import datetime as dt
hours = [(dt.time(i).strftime('%H:00')) for i in range(24)]
plt.xlabel('Hour')
plt.ylabel('Viewer count')
plt.bar(hours, sorted_viewcount_byhour)

5
  • 1
    You can do all of this just using DataFrame methods. You shouldn't be using loops here at all. Can you post some sample json data so we can recreate your output? Please keep the sample data very small and only include relevant columns. Commented May 8, 2019 at 13:34
  • Thanks Dan for your rapid answer! I have added the source JSON. Commented May 8, 2019 at 13:47
  • Thanks for the source JSON, but note that I said to leave out irrelevant columns. You only have two relevant features here, time and session_duration, so remove the others. Also, we clearly need more than one single record to make a histogram. I suggest you edit again to improve that sample source json. Commented May 8, 2019 at 13:51
  • 1
    I have updated the source file again, thanks for the suggestions! Commented May 8, 2019 at 14:20
  • Last suggestion on the source JSON, think about what output that input would give. All your data occur in the same hour, so your histogram would be a single bar. Considering providing data that produce an output that really tests if the method has worked. In this case you will want multiple hours, and with a different number of counts in each so that the histogram looks like a histogram. Commented May 8, 2019 at 14:26

1 Answer 1

1
df = pd.read_json('data/data.json')

df['time'] = pd.to_datetime(df['time'])
#timedelta is a more appropriate data type for session_duration
df['session_duration'] = pd.to_timedelta(df['session_duration'], unit='s')

# Example filtering
df_short_duration = df[df['session_duration'].dt.total_seconds() <= 60]

# Example creating histogram
df_hist = df_short_duration.groupby(df['time'].dt.hour).count()
# Now just plot df_hist as a bar chart using matplotlib, might be something like plt.bar(df_hist.index, df_hist['count'])
Sign up to request clarification or add additional context in comments.

6 Comments

It seems like after being converted to Dataframe the df_short_duration['session_duration'] outputs like this: 00:00:00.000000
that's why I needed to see the source data :) try adding in unit='s' as in the edit above
Thank you, it seems to work! :) A lot of useful pandas syntax have been absorbed today. Another quick question related to the plt.bar() I need to widen it, I have found that I do that by plt.subplots(figsize=(20, 5)) but this creates a new object instead of modifying the existing object that has attributes like title and labels. How should I go if I want the hours to fit nicely on the x axis ?
You should really ask this as a new question. But this is how I do it: fig, ax = plt.subplots(figsize=(20, 5)) and then instead of plt.bar(...) do ax.bar(...)
Thanks Dan! I have opened a new question as I still am struggling with outputting more than two bars to one X-axis value. The thread is linked with this one.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.