How to plot smoothed summary stats in ggplot2

Question

I'm having trouble articulating this question. I have a dataset with daily income and expense for several years. I have been trying a few approaches so there are a lot of date columns now.

> str(df)
'data.frame':   3047 obs. of  8 variables:
 $ Date             : Factor w/ 1219 levels "2014-05-06T00:00:00.0000000",..: 6 9 2 3 4 6 10 11 13 14 ...
 $ YearMonthnumber  : Factor w/ 44 levels "2014/05","2014/06",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cat              : Factor w/ 10 levels "Account Adjustment",..: 1 2 3 3 3 3 3 3 3 3 ...
 $ Value            : num  2.2 277.7 20 14.1 6.8 ...
 $ Income_or_expense: Factor w/ 2 levels "Expense","Income": 1 1 1 1 1 1 1 1 1 1 ...
 $ ddate            : Date, format: "2014-05-16" "2014-05-19" "2014-05-12" "2014-05-13" ...
 $ monthly          : Date, format: "2014-05-01" "2014-05-01" "2014-05-01" "2014-05-01" ...

Basically what I want to plot is:

the sum of each month's income and the sum of each month's expense (ie the value column), where category (cat) is not "Transfer", coloured by income_or_expense
plot a smoothed line through these summary points.

I can do step one, but not two. Here is what I have:

ggplot(data = subset(df, cat!="Transfer"), aes(x = monthly, y= Value, colour = Income_or_expense)) +
  stat_summary(fun.y = sum, geom = "point") +
  scale_x_date(labels = date_format("%Y-%m"))

How can I add a smooth geom to these resulting summary stats?

Edit: If I add + stat_summary(fun.y = sum, geom = "smooth"), the result is a line graph, not a smoothed model. And if I add it without fun.y = sum, then the smoothed line is based on daily values, not the monthly aggregates

Thanks.

Yes, if I add + stat_summary(fun.y = sum, geom = "smooth"), the result is basically a line graph, not a smoothed model. And if I add it without fun.y = sum, then the smoothed line is based on daily values, not the monthly aggregates. — Anienumaked
– Anienumaked, Commented Jan 8, 2018 at 6:20

eipi10 · Accepted Answer · 2018-01-08 07:43:12Z

3

You could summarize the data by month first and then run geom_smooth on the summarized data. I've created some fake time series data for the example.

library(tidyverse)  
library(lubridate)

# Fake data
set.seed(2)
dat = data.frame(value = c(arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364),
                           arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364)) + 100,
                 IE = rep(c("Income","Expense"), each=365),
                 date = rep(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="day"), 2))

Now we sum by month and plot. I've included points for the actual monthly sums to compare with the smoother line:

ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>% 
         summarise(value=sum(value)), 
       aes(month, value, colour=IE, group=IE)) +
  geom_smooth(se=FALSE, span=0.75) +  # span=0.75 is the default
  geom_point() +
  expand_limits(y=0) +
  theme_classic()

I'm not that familiar with time series analysis, but it seems like a better approach would be to calculate the monthly income and expense rate represented by each daily value and then run a smoother through it. That way you're not summarizing away the variation in the underlying data. In the plot below, I've included the individual points so you can compare them with the smoother line.

ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>% 
         mutate(value = value * n()), 
       aes(date, value, colour=IE)) +
  geom_smooth(se=FALSE, span=0.75) +
  geom_point(alpha=0.3, size=1) +
  expand_limits(y=0) +
  theme_classic()

You could also plot the 30-day rolling sum, which avoids grouping the data into arbitrary time periods. Once again, I've included points for the monthly income and expense rate represented by each daily value.

library(xts)

ggplot(dat %>% group_by(IE) %>% 
         mutate(rolling_sum = rollsum(value, k=30, align="center", na.pad=TRUE),
                value = value * 30), 
       aes(date, colour=IE)) +
  geom_line(aes(y=rolling_sum), size=1) +
  geom_point(aes(y=value), alpha=0.2, size=1) +
  expand_limits(y=0) +
  theme_classic()

edited Jan 8, 2018 at 7:43

answered Jan 8, 2018 at 7:06

eipi10

94.6k28 gold badges220 silver badges300 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anienumaked Over a year ago

Thanks for ideas. A question though, when you enter this (dat %>% group_by(IE, month=month(date, label=TRUE)) %>% is it grouping all months across all years (ie January in 2015 and January in 2016 etc), or is it just grouping January in 2015, and January 2016 will be separate? I hope it is the latter.

eipi10 Over a year ago

It's grouping a given month across all years. To group by individual months, you could do group_by(group_by(IE, year = year(date), month=month(date, label=TRUE)). Another option would be group_by(IE, year_month=as.yearmon(date)), using the as.yearmon function from the zoo package (so run library(zoo) first). Your data frame has YearMonthnumber, and you could group by that.

Collectives™ on Stack Overflow

How to plot smoothed summary stats in ggplot2

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related