3

I'm having trouble articulating this question. I have a dataset with daily income and expense for several years. I have been trying a few approaches so there are a lot of date columns now.

> str(df)
'data.frame':   3047 obs. of  8 variables:
 $ Date             : Factor w/ 1219 levels "2014-05-06T00:00:00.0000000",..: 6 9 2 3 4 6 10 11 13 14 ...
 $ YearMonthnumber  : Factor w/ 44 levels "2014/05","2014/06",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cat              : Factor w/ 10 levels "Account Adjustment",..: 1 2 3 3 3 3 3 3 3 3 ...
 $ Value            : num  2.2 277.7 20 14.1 6.8 ...
 $ Income_or_expense: Factor w/ 2 levels "Expense","Income": 1 1 1 1 1 1 1 1 1 1 ...
 $ ddate            : Date, format: "2014-05-16" "2014-05-19" "2014-05-12" "2014-05-13" ...
 $ monthly          : Date, format: "2014-05-01" "2014-05-01" "2014-05-01" "2014-05-01" ...

Basically what I want to plot is:

  • the sum of each month's income and the sum of each month's expense (ie the value column), where category (cat) is not "Transfer", coloured by income_or_expense
  • plot a smoothed line through these summary points.

I can do step one, but not two. Here is what I have:

ggplot(data = subset(df, cat!="Transfer"), aes(x = monthly, y= Value, colour = Income_or_expense)) +
  stat_summary(fun.y = sum, geom = "point") +
  scale_x_date(labels = date_format("%Y-%m"))

How can I add a smooth geom to these resulting summary stats?

Edit: If I add + stat_summary(fun.y = sum, geom = "smooth"), the result is a line graph, not a smoothed model. And if I add it without fun.y = sum, then the smoothed line is based on daily values, not the monthly aggregates

Thanks.

2
  • Did you try: stat_summary(geom = 'smooth' Commented Jan 8, 2018 at 6:17
  • Yes, if I add + stat_summary(fun.y = sum, geom = "smooth"), the result is basically a line graph, not a smoothed model. And if I add it without fun.y = sum, then the smoothed line is based on daily values, not the monthly aggregates. Commented Jan 8, 2018 at 6:20

1 Answer 1

3

You could summarize the data by month first and then run geom_smooth on the summarized data. I've created some fake time series data for the example.

library(tidyverse)  
library(lubridate)

# Fake data
set.seed(2)
dat = data.frame(value = c(arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364),
                           arima.sim(list(order = c(1,1,0), ar = 0.7), n = 364)) + 100,
                 IE = rep(c("Income","Expense"), each=365),
                 date = rep(seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="day"), 2))

Now we sum by month and plot. I've included points for the actual monthly sums to compare with the smoother line:

ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>% 
         summarise(value=sum(value)), 
       aes(month, value, colour=IE, group=IE)) +
  geom_smooth(se=FALSE, span=0.75) +  # span=0.75 is the default
  geom_point() +
  expand_limits(y=0) +
  theme_classic()

enter image description here

I'm not that familiar with time series analysis, but it seems like a better approach would be to calculate the monthly income and expense rate represented by each daily value and then run a smoother through it. That way you're not summarizing away the variation in the underlying data. In the plot below, I've included the individual points so you can compare them with the smoother line.

ggplot(dat %>% group_by(IE, month=month(date, label=TRUE)) %>% 
         mutate(value = value * n()), 
       aes(date, value, colour=IE)) +
  geom_smooth(se=FALSE, span=0.75) +
  geom_point(alpha=0.3, size=1) +
  expand_limits(y=0) +
  theme_classic()

enter image description here

You could also plot the 30-day rolling sum, which avoids grouping the data into arbitrary time periods. Once again, I've included points for the monthly income and expense rate represented by each daily value.

library(xts)

ggplot(dat %>% group_by(IE) %>% 
         mutate(rolling_sum = rollsum(value, k=30, align="center", na.pad=TRUE),
                value = value * 30), 
       aes(date, colour=IE)) +
  geom_line(aes(y=rolling_sum), size=1) +
  geom_point(aes(y=value), alpha=0.2, size=1) +
  expand_limits(y=0) +
  theme_classic()

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for ideas. A question though, when you enter this (dat %>% group_by(IE, month=month(date, label=TRUE)) %>% is it grouping all months across all years (ie January in 2015 and January in 2016 etc), or is it just grouping January in 2015, and January 2016 will be separate? I hope it is the latter.
It's grouping a given month across all years. To group by individual months, you could do group_by(group_by(IE, year = year(date), month=month(date, label=TRUE)). Another option would be group_by(IE, year_month=as.yearmon(date)), using the as.yearmon function from the zoo package (so run library(zoo) first). Your data frame has YearMonthnumber, and you could group by that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.