Pandas MultiIndex: Using same 2nd index for each 1st index

Question

I have a chat log with multiple participants (from whatsapp) that I have converted to a pandas dataframe. The aim is to plot messages sent over time, with a different line/colour for each person, in a few different plots styles; bar charts, line plots etc. (this is mostly a practise exercise for me).

I have a class object myConvo, where myConvo.message_log is a dataframe of the conversation. There is some Dummy Data at the bottom of this post if it helps. I start by filtering desired data by date:

start_date=pd.Timestamp("2019-01-01 00:00:00")
end_date=pd.Timestamp("2019-12-31 00:00:00")
filt = (myConvo.message_log["date"] >= start_date) & (myConvo.message_log["date"] <= end_date)
df = myConvo.message_log[filt]
df.set_index("date", inplace=True)

I then get my message counts (y data) by grouping data by sender and using count(), and resampling to get data binned daily:

grouped_df = df.groupby(["sender"])
grouped_df = grouped_df[["sender"]].resample("D").count()

Side note: My program also has an option to plot cumulative messages sent for each person, which I have to get at 1 person at a time like this:

grouped_df.loc["Person 3"].cumsum()

Ideally I want to plot either the message counts per day for each person (ie. a plot of grouped_df) OR the cumulative messages sent. I'm not sure how to do this using pandas inbuilt plot methods, but have previously been doing this without pandas by passing lists to matplotlib.

Now that I am using pandas, I have been converting the data to lists and plotting using matplotlib, which works BUT as you'll see for Person 3, their time data (primary index) isn't the same as Person 1 or Person 2's time index data, so converting these to lists generates a different length list for each person. Matplotlib then throws an error when trying to plot this using the one x axis data (in list format).

# Legend Data
participants = list(df["sender"].unique())

# Create y data; A list of values (message counts) for each person
participants_message_count = [ list(grouped_df["sender"].loc[person]) for person in participants ]
participants_message_cumsum = [ list(grouped_df["sender"].loc[person].cumsum()) for person in participants ]

So my question is either: How do I plot a multi index dataframe with primary index datetimes as x axis, and each secondary index as a different line? OR How do I make sure the dataframe uses the same secondary axis values for each user, padding the message count column with zeros for any missing data?

Dummy Data:

{'sender': {Timestamp('2019-07-29 19:58:00'): 'Person 2',
  Timestamp('2019-07-29 20:03:00'): 'Person 1',
  Timestamp('2019-01-08 19:22:00'): 'Person 2',
  Timestamp('2019-01-08 19:23:00'): 'Person 1',
  Timestamp('2019-01-08 19:25:00'): 'Person 2',
  Timestamp('2019-04-08 11:28:00'): 'Person 1',
  Timestamp('2019-04-08 11:29:00'): 'Person 1',
  Timestamp('2019-04-08 12:43:00'): 'Person 1',
  Timestamp('2019-04-08 12:49:00'): 'Person 2',
  Timestamp('2019-04-08 12:51:00'): 'Person 2',
  Timestamp('2019-08-25 22:33:00'): 'Person 1',
  Timestamp('2019-08-27 11:55:00'): 'Person 2',
  Timestamp('2019-08-27 18:35:00'): 'Person 2',
  Timestamp('2019-06-11 18:53:00'): 'Person 3',
  Timestamp('2019-06-11 18:54:00'): 'Person 2',
  Timestamp('2019-06-11 20:42:00'): 'Person 1',
  Timestamp('2019-07-11 00:16:00'): 'Person 2',
  Timestamp('2019-07-11 15:24:00'): 'Person 1',
  Timestamp('2019-07-11 16:06:00'): 'Person 2',
  Timestamp('2019-08-11 11:48:00'): 'Person 2',
  Timestamp('2019-08-11 11:53:00'): 'Person 1',
  Timestamp('2019-08-11 11:55:00'): 'Person 2',
  Timestamp('2019-08-11 11:59:00'): 'Person 3',
  Timestamp('2019-08-11 12:03:00'): 'Person 2',
  Timestamp('2019-12-24 13:40:00'): 'Person 2',
  Timestamp('2019-12-24 13:42:00'): 'Person 1',
  Timestamp('2019-12-24 13:43:00'): 'Person 2',
  Timestamp('2019-12-24 13:44:00'): 'Person 2'},
 'message': {Timestamp('2019-07-29 19:58:00'): 'Hello',
  Timestamp('2019-07-29 20:03:00'): 'Hi there',
  Timestamp('2019-01-08 19:22:00'): "How's things",
  Timestamp('2019-01-08 19:23:00'): 'good',
  Timestamp('2019-01-08 19:25:00'): 'I am glad',
  Timestamp('2019-04-08 11:28:00'): 'Me too.',
  Timestamp('2019-04-08 11:29:00'): 'Indeed we are.',
  Timestamp('2019-04-08 12:43:00'): 'I sure hope this is enough fake conversation for stackoverflow.',
  Timestamp('2019-04-08 12:49:00'): 'Better write a few more messages just in case',
  Timestamp('2019-04-08 12:51:00'): 'Oh yeah.',
  Timestamp('2019-08-25 22:33:00'): "I'm going to stop now.",
  Timestamp('2019-08-27 11:55:00'): 'redacted',
  Timestamp('2019-08-27 18:35:00'): 'redacted',
  Timestamp('2019-06-11 18:53:00'): 'redacted',
  Timestamp('2019-06-11 18:54:00'): 'redacted',
  Timestamp('2019-06-11 20:42:00'): 'redacted',
  Timestamp('2019-07-11 00:16:00'): 'redacted',
  Timestamp('2019-07-11 15:24:00'): 'redacted',
  Timestamp('2019-07-11 16:06:00'): 'redacted',
  Timestamp('2019-08-11 11:48:00'): 'redacted',
  Timestamp('2019-08-11 11:53:00'): 'redacted',
  Timestamp('2019-08-11 11:55:00'): 'redacted',
  Timestamp('2019-08-11 11:59:00'): 'redacted',
  Timestamp('2019-08-11 12:03:00'): 'redacted',
  Timestamp('2019-12-24 13:40:00'): 'redacted',
  Timestamp('2019-12-24 13:42:00'): 'redacted',
  Timestamp('2019-12-24 13:43:00'): 'redacted',
  Timestamp('2019-12-24 13:44:00'): 'redacted'}}

Quang Hoang · Accepted Answer · 2020-06-10 14:34:01Z

1

You can unstack sender after groupby so it become columns and plot:

(df.groupby('sender').message
   .resample('D').count()
   .unstack('sender')
   .plot()
)

Output:

And if you want a cummulative sum, just do so before .plot:

(df.groupby('sender').message
   .resample('D').count()
   .unstack('sender')
   .cumsum()
   .plot()
)

Output:

edited Jun 10, 2020 at 14:34

answered Jun 10, 2020 at 14:28

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

gazm2k5 Over a year ago

Thanks! That solves both problems for me. And such a speedy response.

Collectives™ on Stack Overflow

Pandas MultiIndex: Using same 2nd index for each 1st index

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related