I have a chat log with multiple participants (from whatsapp) that I have converted to a pandas dataframe. The aim is to plot messages sent over time, with a different line/colour for each person, in a few different plots styles; bar charts, line plots etc. (this is mostly a practise exercise for me).
I have a class object myConvo, where myConvo.message_log is a dataframe of the conversation. There is some Dummy Data at the bottom of this post if it helps. I start by filtering desired data by date:
start_date=pd.Timestamp("2019-01-01 00:00:00")
end_date=pd.Timestamp("2019-12-31 00:00:00")
filt = (myConvo.message_log["date"] >= start_date) & (myConvo.message_log["date"] <= end_date)
df = myConvo.message_log[filt]
df.set_index("date", inplace=True)
I then get my message counts (y data) by grouping data by sender and using count(), and resampling to get data binned daily:
grouped_df = df.groupby(["sender"])
grouped_df = grouped_df[["sender"]].resample("D").count()
Side note: My program also has an option to plot cumulative messages sent for each person, which I have to get at 1 person at a time like this:
grouped_df.loc["Person 3"].cumsum()
Ideally I want to plot either the message counts per day for each person (ie. a plot of grouped_df) OR the cumulative messages sent. I'm not sure how to do this using pandas inbuilt plot methods, but have previously been doing this without pandas by passing lists to matplotlib.
Now that I am using pandas, I have been converting the data to lists and plotting using matplotlib, which works BUT as you'll see for Person 3, their time data (primary index) isn't the same as Person 1 or Person 2's time index data, so converting these to lists generates a different length list for each person. Matplotlib then throws an error when trying to plot this using the one x axis data (in list format).
# Legend Data
participants = list(df["sender"].unique())
# Create y data; A list of values (message counts) for each person
participants_message_count = [ list(grouped_df["sender"].loc[person]) for person in participants ]
participants_message_cumsum = [ list(grouped_df["sender"].loc[person].cumsum()) for person in participants ]
So my question is either: How do I plot a multi index dataframe with primary index datetimes as x axis, and each secondary index as a different line? OR How do I make sure the dataframe uses the same secondary axis values for each user, padding the message count column with zeros for any missing data?
Dummy Data:
{'sender': {Timestamp('2019-07-29 19:58:00'): 'Person 2',
Timestamp('2019-07-29 20:03:00'): 'Person 1',
Timestamp('2019-01-08 19:22:00'): 'Person 2',
Timestamp('2019-01-08 19:23:00'): 'Person 1',
Timestamp('2019-01-08 19:25:00'): 'Person 2',
Timestamp('2019-04-08 11:28:00'): 'Person 1',
Timestamp('2019-04-08 11:29:00'): 'Person 1',
Timestamp('2019-04-08 12:43:00'): 'Person 1',
Timestamp('2019-04-08 12:49:00'): 'Person 2',
Timestamp('2019-04-08 12:51:00'): 'Person 2',
Timestamp('2019-08-25 22:33:00'): 'Person 1',
Timestamp('2019-08-27 11:55:00'): 'Person 2',
Timestamp('2019-08-27 18:35:00'): 'Person 2',
Timestamp('2019-06-11 18:53:00'): 'Person 3',
Timestamp('2019-06-11 18:54:00'): 'Person 2',
Timestamp('2019-06-11 20:42:00'): 'Person 1',
Timestamp('2019-07-11 00:16:00'): 'Person 2',
Timestamp('2019-07-11 15:24:00'): 'Person 1',
Timestamp('2019-07-11 16:06:00'): 'Person 2',
Timestamp('2019-08-11 11:48:00'): 'Person 2',
Timestamp('2019-08-11 11:53:00'): 'Person 1',
Timestamp('2019-08-11 11:55:00'): 'Person 2',
Timestamp('2019-08-11 11:59:00'): 'Person 3',
Timestamp('2019-08-11 12:03:00'): 'Person 2',
Timestamp('2019-12-24 13:40:00'): 'Person 2',
Timestamp('2019-12-24 13:42:00'): 'Person 1',
Timestamp('2019-12-24 13:43:00'): 'Person 2',
Timestamp('2019-12-24 13:44:00'): 'Person 2'},
'message': {Timestamp('2019-07-29 19:58:00'): 'Hello',
Timestamp('2019-07-29 20:03:00'): 'Hi there',
Timestamp('2019-01-08 19:22:00'): "How's things",
Timestamp('2019-01-08 19:23:00'): 'good',
Timestamp('2019-01-08 19:25:00'): 'I am glad',
Timestamp('2019-04-08 11:28:00'): 'Me too.',
Timestamp('2019-04-08 11:29:00'): 'Indeed we are.',
Timestamp('2019-04-08 12:43:00'): 'I sure hope this is enough fake conversation for stackoverflow.',
Timestamp('2019-04-08 12:49:00'): 'Better write a few more messages just in case',
Timestamp('2019-04-08 12:51:00'): 'Oh yeah.',
Timestamp('2019-08-25 22:33:00'): "I'm going to stop now.",
Timestamp('2019-08-27 11:55:00'): 'redacted',
Timestamp('2019-08-27 18:35:00'): 'redacted',
Timestamp('2019-06-11 18:53:00'): 'redacted',
Timestamp('2019-06-11 18:54:00'): 'redacted',
Timestamp('2019-06-11 20:42:00'): 'redacted',
Timestamp('2019-07-11 00:16:00'): 'redacted',
Timestamp('2019-07-11 15:24:00'): 'redacted',
Timestamp('2019-07-11 16:06:00'): 'redacted',
Timestamp('2019-08-11 11:48:00'): 'redacted',
Timestamp('2019-08-11 11:53:00'): 'redacted',
Timestamp('2019-08-11 11:55:00'): 'redacted',
Timestamp('2019-08-11 11:59:00'): 'redacted',
Timestamp('2019-08-11 12:03:00'): 'redacted',
Timestamp('2019-12-24 13:40:00'): 'redacted',
Timestamp('2019-12-24 13:42:00'): 'redacted',
Timestamp('2019-12-24 13:43:00'): 'redacted',
Timestamp('2019-12-24 13:44:00'): 'redacted'}}

