0

I'm working with American Community Survey (ACS) 1-year estimates for a specific location over several years. For example, I'm trying to plot how the proportion of men and women riding a bike to work changes over time. From the ACS, I get estimates and standard error, which I can then use to calculate the upper and lower bounds of the estimates.

So the simplified data structure in wide format is like this:

| Year | EstimateM | MaxM | MinM | EstimateF | MaxF | MinF |
|------|-----------|------|------|-----------|------|------|
| 2005 | 3.0       | 3.5  | 2.5  | 2.0       | 2.3  | 1.7  |
| 2006 | 3.1       | 3.5  | 2.6  | 2.0       | 2.3  | 1.7  |
| 2007 | 5.0       | 4.2  | 5.8  | 2.5       | 3.0  | 2.0  |
| ...  | ...       | ...  | ...  | ...       | ...  | ...  |

If I only wanted to plot the estimates, I'd melt the data with only the two Estimate variables as measure.vars

GenderModeCombined_long <- melt(GenderModeCombined,
                            id = "Year",
                            measure.vars = c("EstimateM",
                                             "EstimateF")

The long data can then be easily plotted with ggplot2

ggplot(data=GenderModeCombined_long,
  aes(x=year, y=value, colour=variable)) +
  geom_point() +
  geom_line()

This produces a graph like so

Imgur

(sorry, don't have enough rep to post images)

Where I'm stuck is how to add error bars to the two estimate graphs. I could add them as measure vars to the melted dataset, but then how do I tell ggplot what should be plotted as values and what as error bars? Do I have to create a separate data frame with just the min/max data and then load that separately?

geom_errorbar(data = errordataMmax, aes(ymax = ??, ymin = ??)) 

I have the feeling that I'm somehow approaching this the wrong way and/or have my data set up the wrong way.

1
  • 1
    If you can make this question reproducible, you are much more likely to get a useful answer.. Commented Dec 28, 2018 at 21:04

2 Answers 2

1

Welcome to SO. The problem here is that you have three "explicit" variables (Estimate, Min and Max) and an "implicit" one (gender) which is coded in column names. A way to solve this is to make "gender" an explicit grouping variable. After you go to long format, create a "gender" variable, remove the indication of gender from the key column (variable) and then go back to wide format. Something like this would work:

library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

GenderModeCombined <- tibble::tribble(
  ~Year,   ~EstimateM,   ~MaxM,   ~MinM,   ~EstimateF,   ~MaxF,   ~MinF,  
  2005,         3.0,    3.5,    2.5,         2.0,    2.3,    1.7,  
  2006,         3.1,    3.5,    2.6,         2.0,    2.3,    1.7,  
  2007,         5.0,    4.2,    5.8,         2.5,    3.0,    2.0
)

GenderModeCombined.long <- GenderModeCombined %>% 
  # switch to long format
  tidyr::gather(variable, value, -Year,  factor_key = TRUE) %>% 
  # add a gender variable
  dplyr::mutate(gender   = stringr::str_sub(variable, -1)) %>% 
  # remove gender indication from the key column `variable`
  dplyr::mutate(variable = stringr::str_sub(variable, end = -2)) %>%
  # back to wide format
  tidyr::spread(variable, value)

GenderModeCombined.long
#> # A tibble: 6 x 5
#>    Year gender Estimate   Max   Min
#>   <dbl> <chr>     <dbl> <dbl> <dbl>
#> 1  2005 F           2     2.3   1.7
#> 2  2005 M           3     3.5   2.5
#> 3  2006 F           2     2.3   1.7
#> 4  2006 M           3.1   3.5   2.6
#> 5  2007 F           2.5   3     2  
#> 6  2007 M           5     4.2   5.8

ggplot(data=GenderModeCombined.long,
       aes(x=Year, y=Estimate,colour = gender)) +
  geom_point() +
  geom_line() + 
  geom_errorbar(aes(ymax = Max, ymin = Min))  

Created on 2018-12-29 by the reprex package (v0.2.1)

Sign up to request clarification or add additional context in comments.

3 Comments

The lines are missing from the chart because Year is of type character which is treated as discrete variable by ggplot2. Is there a specific reason to pass all numeric data as type character to tibble::tribble()?
no reason at all. Just cut and paste laziness... I amended the answer. Thanks.
Thank you! So the block was indeed in the data structure -- I just couldn't wrap my head around how to fix it. Much appreciated!
1

As explained by lbusett, the answer to this question is not so much about plotting but about reshaping the data from wide to long form. The challenge here is that there are multiple value columns, i.e., Estimate, Max, Min, for each gender.

As of version v1.9.6 (on CRAN 19 Sep 2015), 's incarnation of the melt() function allows for melting, i.e., reshaping from wide to long format, into multiple columns in one go:

library(data.table)
options(datatable.print.class = TRUE)
cols <- c("Estimate", "Max", "Min")
long <- melt(setDT(GenderModeCombined), id.vars = "Year", measure.vars = patterns(cols), 
             value.name = cols, variable.name = "Gender")[
               , Gender := forcats::lvls_revalue(Gender, c("M", "F"))][]
long
    Year Gender Estimate   Max   Min
   <int> <fctr>    <num> <num> <num>
1:  2005      M      3.0   3.5   2.5
2:  2006      M      3.1   3.5   2.6
3:  2007      M      5.0   4.2   5.8
4:  2005      F      2.0   2.3   1.7
5:  2006      F      2.0   2.3   1.7
6:  2007      F      2.5   3.0   2.0

Now, we have three observations per Year and Gender which can be plotted as desired:

library(ggplot2)
ggplot(long, aes(x = Year, y = Estimate, colour = Gender)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymax = Max, ymin = Min), width = 0.1)

enter image description here

Please, note that this chart shows also lines in addition to points and error bars. This is because Year is of type integer which is recognized by ggplot2 as continuous variable.

Data

's fread() function is very handy to read various data formats. So, we can read the data as posted by the OP with only a few modifications:

library(data.table)
GenderModeCombined <- fread(
"| Year | EstimateM | MaxM | MinM | EstimateF | MaxF | MinF |
| 2005 | 3.0       | 3.5  | 2.5  | 2.0       | 2.3  | 1.7  |
| 2006 | 3.1       | 3.5  | 2.6  | 2.0       | 2.3  | 1.7  |
| 2007 | 5.0       | 4.2  | 5.8  | 2.5       | 3.0  | 2.0  |
", drop = c(1L, 9L))

GenderModeCombined
    Year EstimateM  MaxM  MinM EstimateF  MaxF  MinF
   <int>     <num> <num> <num>     <num> <num> <num>
1:  2005       3.0   3.5   2.5       2.0   2.3   1.7
2:  2006       3.1   3.5   2.6       2.0   2.3   1.7
3:  2007       5.0   4.2   5.8       2.5   3.0   2.0

1 Comment

Thank you. This solution works for me as well. @lbusett 's code with tidyverse is a little easier for me to read, but good to know that it can also be done with melt

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.