Plot time series with known error (ggplot2)

Question

I'm working with American Community Survey (ACS) 1-year estimates for a specific location over several years. For example, I'm trying to plot how the proportion of men and women riding a bike to work changes over time. From the ACS, I get estimates and standard error, which I can then use to calculate the upper and lower bounds of the estimates.

So the simplified data structure in wide format is like this:

| Year | EstimateM | MaxM | MinM | EstimateF | MaxF | MinF |
|------|-----------|------|------|-----------|------|------|
| 2005 | 3.0       | 3.5  | 2.5  | 2.0       | 2.3  | 1.7  |
| 2006 | 3.1       | 3.5  | 2.6  | 2.0       | 2.3  | 1.7  |
| 2007 | 5.0       | 4.2  | 5.8  | 2.5       | 3.0  | 2.0  |
| ...  | ...       | ...  | ...  | ...       | ...  | ...  |

If I only wanted to plot the estimates, I'd melt the data with only the two Estimate variables as measure.vars

GenderModeCombined_long <- melt(GenderModeCombined,
                            id = "Year",
                            measure.vars = c("EstimateM",
                                             "EstimateF")

The long data can then be easily plotted with ggplot2

ggplot(data=GenderModeCombined_long,
  aes(x=year, y=value, colour=variable)) +
  geom_point() +
  geom_line()

This produces a graph like so

Imgur

(sorry, don't have enough rep to post images)

Where I'm stuck is how to add error bars to the two estimate graphs. I could add them as measure vars to the melted dataset, but then how do I tell ggplot what should be plotted as values and what as error bars? Do I have to create a separate data frame with just the min/max data and then load that separately?

geom_errorbar(data = errordataMmax, aes(ymax = ??, ymin = ??))

I have the feeling that I'm somehow approaching this the wrong way and/or have my data set up the wrong way.

If you can make this question reproducible, you are much more likely to get a useful answer.. — Axeman
– Axeman, Commented Dec 28, 2018 at 21:04

lbusett · Accepted Answer · 2018-12-29 14:04:34Z

1

Welcome to SO. The problem here is that you have three "explicit" variables (Estimate, Min and Max) and an "implicit" one (gender) which is coded in column names. A way to solve this is to make "gender" an explicit grouping variable. After you go to long format, create a "gender" variable, remove the indication of gender from the key column (variable) and then go back to wide format. Something like this would work:

library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

GenderModeCombined <- tibble::tribble(
  ~Year,   ~EstimateM,   ~MaxM,   ~MinM,   ~EstimateF,   ~MaxF,   ~MinF,  
  2005,         3.0,    3.5,    2.5,         2.0,    2.3,    1.7,  
  2006,         3.1,    3.5,    2.6,         2.0,    2.3,    1.7,  
  2007,         5.0,    4.2,    5.8,         2.5,    3.0,    2.0
)

GenderModeCombined.long <- GenderModeCombined %>% 
  # switch to long format
  tidyr::gather(variable, value, -Year,  factor_key = TRUE) %>% 
  # add a gender variable
  dplyr::mutate(gender   = stringr::str_sub(variable, -1)) %>% 
  # remove gender indication from the key column `variable`
  dplyr::mutate(variable = stringr::str_sub(variable, end = -2)) %>%
  # back to wide format
  tidyr::spread(variable, value)

GenderModeCombined.long
#> # A tibble: 6 x 5
#>    Year gender Estimate   Max   Min
#>   <dbl> <chr>     <dbl> <dbl> <dbl>
#> 1  2005 F           2     2.3   1.7
#> 2  2005 M           3     3.5   2.5
#> 3  2006 F           2     2.3   1.7
#> 4  2006 M           3.1   3.5   2.6
#> 5  2007 F           2.5   3     2  
#> 6  2007 M           5     4.2   5.8

ggplot(data=GenderModeCombined.long,
       aes(x=Year, y=Estimate,colour = gender)) +
  geom_point() +
  geom_line() + 
  geom_errorbar(aes(ymax = Max, ymin = Min))

^{Created on 2018-12-29 by the reprex package (v0.2.1)}

edited Dec 29, 2018 at 14:04

answered Dec 28, 2018 at 21:25

lbusett

5,9722 gold badges28 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Uwe Over a year ago

The lines are missing from the chart because Year is of type character which is treated as discrete variable by ggplot2. Is there a specific reason to pass all numeric data as type character to tibble::tribble()?

lbusett Over a year ago

no reason at all. Just cut and paste laziness... I amended the answer. Thanks.

vgXhc Over a year ago

Thank you! So the block was indeed in the data structure -- I just couldn't wrap my head around how to fix it. Much appreciated!

Community · Accepted Answer · 2020-06-20 09:12:55Z

As explained by lbusett, the answer to this question is not so much about plotting but about reshaping the data from wide to long form. The challenge here is that there are multiple value columns, i.e., Estimate, Max, Min, for each gender.

As of version v1.9.6 (on CRAN 19 Sep 2015), data.table's incarnation of the melt() function allows for melting, i.e., reshaping from wide to long format, into multiple columns in one go:

library(data.table)
options(datatable.print.class = TRUE)
cols <- c("Estimate", "Max", "Min")
long <- melt(setDT(GenderModeCombined), id.vars = "Year", measure.vars = patterns(cols), 
             value.name = cols, variable.name = "Gender")[
               , Gender := forcats::lvls_revalue(Gender, c("M", "F"))][]
long

    Year Gender Estimate   Max   Min
   <int> <fctr>    <num> <num> <num>
1:  2005      M      3.0   3.5   2.5
2:  2006      M      3.1   3.5   2.6
3:  2007      M      5.0   4.2   5.8
4:  2005      F      2.0   2.3   1.7
5:  2006      F      2.0   2.3   1.7
6:  2007      F      2.5   3.0   2.0

Now, we have three observations per Year and Gender which can be plotted as desired:

library(ggplot2)
ggplot(long, aes(x = Year, y = Estimate, colour = Gender)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymax = Max, ymin = Min), width = 0.1)

Please, note that this chart shows also lines in addition to points and error bars. This is because Year is of type integer which is recognized by ggplot2 as continuous variable.

Data

data.table's fread() function is very handy to read various data formats. So, we can read the data as posted by the OP with only a few modifications:

library(data.table)
GenderModeCombined <- fread(
"| Year | EstimateM | MaxM | MinM | EstimateF | MaxF | MinF |
| 2005 | 3.0       | 3.5  | 2.5  | 2.0       | 2.3  | 1.7  |
| 2006 | 3.1       | 3.5  | 2.6  | 2.0       | 2.3  | 1.7  |
| 2007 | 5.0       | 4.2  | 5.8  | 2.5       | 3.0  | 2.0  |
", drop = c(1L, 9L))

GenderModeCombined

    Year EstimateM  MaxM  MinM EstimateF  MaxF  MinF
   <int>     <num> <num> <num>     <num> <num> <num>
1:  2005       3.0   3.5   2.5       2.0   2.3   1.7
2:  2006       3.1   3.5   2.6       2.0   2.3   1.7
3:  2007       5.0   4.2   5.8       2.5   3.0   2.0

Thank you. This solution works for me as well. @lbusett 's code with tidyverse is a little easier for me to read, but good to know that it can also be done with melt

Collectives™ on Stack Overflow

Plot time series with known error (ggplot2)

2 Answers 2

3 Comments

Data

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Data

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related