1

Problem

I have just started R two days back. I have gone through some basic R tutorials and I am able to plot two dimensional data. I pull data from an Oracle database. Now, I am having problems when I try to merge two graph types (Line and Bar) using secondary axis.

I have no problem, plotting this data on Excel. Following is the plot:

enter image description here

I am unable to plot it on R. I searched some related examples but I am unable to tweak it as per my requirements (Combining Bar and Line chart (double axis) in ggplot2)

Code

Following is the code I am using to plot bar and line graphs separately:

Bar:

p <- ggplot(data = df, aes(x = MONTHS, y = BASE)) + 
    geom_bar(stat="identity") + 
    theme_minimal() +
    geom_text(aes(label = BASE), vjust = 1.6, color = "White", size = 2.5)

Line:

p1 <- ggplot(data = df, aes(x = MONTHS, y = df$INTERNETPERCENTAGE, group = 1)) + 
    geom_line() + 
    geom_point()

Data

Update: I have the following data (raw data cleansed of "," and "%" signs):

> dput(head(df,20))
structure(list(MONTHS = structure(c(11L, 10L, 3L, 5L, 4L, 8L, 
1L, 9L, 7L, 6L, 2L, 13L, 12L), .Label = c("Apr-18", "Aug-18", 
"Dec-17", "Feb-18", "Jan-18", "Jul-18", "Jun-18", "Mar-18", "May-18", 
"Nov-17", "Oct-17", "Oct-18", "Sep-18"), class = "factor"), BASE = c(40756228L, 
41088219L, 41642601L, 42017111L, 42439446L, 42847468L, 43375319L, 
43440484L, 43464735L, 43326823L, 43190949L, 43015301L, 42780071L
), INTERNETUSERGREATERTHAN0KB = c(13380576L, 13224502L, 14044105L, 
14239169L, 14011423L, 14736043L, 14487827L, 14460410L, 14632695L, 
14896654L, 15019329L, 14141766L, 14209288L), INTERNETPERCENTAGE = c(33L, 
32L, 34L, 34L, 33L, 34L, 33L, 33L, 34L, 34L, 35L, 33L, 33L), 
    SMARTPHONE = c(11610216L, 11875033L, 12225965L, 12412010L, 
    12760251L, 12781082L, 13142400L, 13295826L, 13422476L, 13408216L, 
    13504339L, 13413596L, 13586438L), SMARTPHONEPERCENTAGE = c(28L, 
    29L, 29L, 30L, 30L, 30L, 30L, 31L, 31L, 31L, 31L, 31L, 32L
    ), INTERNETUSAGEGREATERTHAN0KB4G = c(829095L, 969531L, 1181411L, 
    1339620L, 1474300L, 1733027L, 1871816L, 1967129L, 2117418L, 
    2288215L, 2453243L, 2624865L, 2817199L)), row.names = c(NA, 
13L), class = "data.frame")
4
  • p and p1 works perfectly fine, I am unable to merge these two graphs.While, making primary y-axis of the Line graph as the secondary y-axis of the merged one. I was unable to formalize any code which can return any plot for this merged type hence I have not posted any, in my question. Commented Dec 25, 2018 at 13:56
  • Could you add data to your question using dput(head(df,20))? Commented Dec 25, 2018 at 13:57
  • ; does nothing, just a habit I guess. Commented Dec 25, 2018 at 13:57
  • I can clean it as well, but on database side. Like remove the "%" sign or remove "," Commented Dec 25, 2018 at 14:14

3 Answers 3

6

Note that my answer is based on your original "un-cleaned" data (which I attach at the bottom of my post).

The key here is to transform percentage values such that they use the same range as BASE. We then apply the inverse of the transformation to show the original percentage values as a second y-axis.

A (personal) word of caution: Secondary axes are often not a good idea. Personally, I would use facets or two separate graphs to avoid confusion and overloading of the graph. Also note that Hadley himself is not a fan of dual y axes, so the ggplot2 support of dual axes is (sensibly) limited.

That aside, here is an option:

  1. First off, let's clean-up the data (remove the thousands separator, percentage sign etc.).

    library(tidyverse)
    df.clean <- df %>%
        mutate_if(is.factor, as.character) %>%
        gather(USAGE, PERCENTAGE, INTERNETPERCENTAGE, SMARTPHONEPERCENTAGE) %>%
        mutate(
            MONTHS = factor(MONTHS, levels = df$MONTHS),
            BASE = as.numeric(str_replace_all(BASE, ",", "")),
            PERCENTAGE = as.numeric(str_replace(PERCENTAGE, "%", "")))
    
  2. We now calculate the transformation coefficients:

    y1 <- min(df.clean$BASE)
    y2 <- max(df.clean$BASE)
    x1 <- min(df.clean$PERCENTAGE)
    x2 <- max(df.clean$PERCENTAGE)
    b <- (y2 - y1) / (x2 - x1)
    a <- y1 - b * x1
    
  3. Now for the plotting:

    df.clean %>%
        mutate(perc.scaled = a + b * PERCENTAGE) %>%
        ggplot(aes(MONTHS, BASE)) +
        geom_col(
            data = df.clean %>% distinct(MONTHS, .keep_all = TRUE),
            aes(MONTHS, BASE), fill = "dodgerblue4") +
        geom_point(aes(MONTHS, perc.scaled, colour = USAGE, group = USAGE)) +
        geom_line(aes(MONTHS, perc.scaled, colour = USAGE, group = USAGE)) +
        geom_label(
            aes(MONTHS, perc.scaled, label = PERCENTAGE, fill = USAGE),
            vjust = 1.4,
            show.legend = F) +
        scale_y_continuous(
                name =  "BASE",
                sec.axis = sec_axis(~ (. - a) / b, name = "Percentage")) +
        coord_cartesian(ylim = c(0.99 * min(df.clean$BASE), max(df.clean$BASE))) +
        theme_minimal() +
        theme(legend.position = "bottom")
    

enter image description here


Sample data

df <- structure(list(MONTHS = structure(c(11L, 10L, 3L, 5L, 4L, 8L,
1L, 9L, 7L, 6L, 2L, 13L, 12L), .Label = c("APR-18", "AUG-18",
"DEC-17", "FEB-18", "JAN-18", "JUL-18", "JUN-18", "MAR-18", "MAY-18",
"NOV-17", "OCT-17", "OCT-18", "SEP-18"), class = "factor"), BASE = structure(c(1L,
2L, 3L, 4L, 5L, 7L, 11L, 12L, 13L, 10L, 9L, 8L, 6L), .Label = c("40,756,228",
"41,088,219", "41,642,601", "42,017,111", "42,439,446", "42,780,071",
"42,847,468", "43,015,301", "43,190,949", "43,326,823", "43,375,319",
"43,440,484", "43,464,735"), class = "factor"), INTERNETUSERGREATERTHAN0KB = structure(c(2L,
1L, 4L, 7L, 3L, 11L, 9L, 8L, 10L, 12L, 13L, 5L, 6L), .Label = c("13,224,502",
"13,380,576", "14,011,423", "14,044,105", "14,141,766", "14,209,288",
"14,239,169", "14,460,410", "14,487,827", "14,632,695", "14,736,043",
"14,896,654", "15,019,329"), class = "factor"), INTERNETPERCENTAGE = structure(c(2L,
1L, 3L, 3L, 2L, 3L, 2L, 2L, 3L, 3L, 4L, 2L, 2L), .Label = c("32%",
"33%", "34%", "35%"), class = "factor"), SMARTPHONE = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 11L, 9L, 12L, 10L, 13L), .Label = c("11,610,216",
"11,875,033", "12,225,965", "12,412,010", "12,760,251", "12,781,082",
"13,142,400", "13,295,826", "13,408,216", "13,413,596", "13,422,476",
"13,504,339", "13,586,438"), class = "factor"), SMARTPHONEPERCENTAGE = structure(c(1L,
2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L), .Label = c("28%",
"29%", "30%", "31%", "32%"), class = "factor"), INTERNETUSAGEGREATERTHAN0KB4G = structure(c(12L,
13L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L), .Label = c("1,181,411 ",
"1,339,620 ", "1,474,300 ", "1,733,027 ", "1,871,816 ", "1,967,129 ",
"2,117,418 ", "2,288,215 ", "2,453,243 ", "2,624,865 ", "2,817,199 ",
"829,095 ", "969,531 "), class = "factor")), row.names = c(NA,
13L), class = "data.frame")
Sign up to request clarification or add additional context in comments.

2 Comments

Why the downvote? Happy to explain if something is not clear.
I did'nt down vote. In fact, I learnt more than 1 thing in your answer. Thank you so much for your help and time.
2

ggplot is a "high-level" plotting library, meaning that it is built to express clear relationships in data, rather than a simple system for drawing shapes. One of its fundamental assumptions is that secondary or dual data axes are usually a bad idea; such figures plot more than one relationship into the same space, with no guarantee that the two axes actually share a meaningful relationship (see for example spurious correlations).

All that said, ggplot does indeed have the ability to define secondary axes, although using it for the purpose you describe is intentionally difficult. One way to accomplish your goal would be to split your data set into two separate ones, then plot these in the same ggplot object. It's certainly possible, but take note of how much extra code is required to produce the effect you're after.

library(tidyverse)
library(scales)

df.base <- df[c('MONTHS', 'BASE')] %>% 
  mutate(MONTHS = factor(MONTHS, MONTHS, ordered = T))

df.percent <- df[c('MONTHS', 'INTERNETPERCENTAGE', 'SMARTPHONEPERCENTAGE')] %>% 
  gather(variable, value, -MONTHS)

g <- ggplot(data = df.base, aes(x = MONTHS, y = BASE)) +
  geom_col(aes(fill = 'BASE')) +
  geom_line(data = df.percent, aes(x = MONTHS, y = value / 40 * 12500000 + 33500000, color = variable, group = variable)) +
  geom_point(data = df.percent, aes(x = MONTHS, y = value / 40 * 12500000 + 33500000, color = variable)) +
  geom_label(data = df.percent, aes(x = MONTHS, y = value / 40 * 12500000 + 33500000, fill = variable, label = sprintf('%i%%', value)), color = 'white', vjust = 1.6, size = 4) +
  scale_y_continuous(sec.axis = sec_axis(~(. - 33500000) / 12500000 * 40, name = 'PERCENT'), labels = comma) +
  scale_fill_manual(values = c('lightblue', 'red', 'darkgreen')) +
  scale_color_manual(values = c('red', 'darkgreen')) +
  coord_cartesian(ylim = c(33500000, 45500000)) +
  labs(fill = NULL, color = NULL) +
  theme_minimal()
print(g)

enter image description here

1 Comment

Your answer proved to be of great help. No more thing, Can we change the position of Red labels upwards, to avoid overlapping. Thanks!
2

You need to have a transformation factor that resembles the ratio of maximum y-axis 1 vs. maximum y-axis 2. Here, the secondary y-axis should be rougly 100,000 times smaller than the primary y-axis. Therefore:

Code

ggplot(df) + 
    geom_col(aes(x = MONTHS, y = BASE)) +
    # apply transformation factor to line plot
    geom_line(aes(x = MONTHS, y = INTERNETPERCENTAGE/0.000001, group = 1), 
              color = "red", size = 1) +
    theme_minimal() +
    geom_text(aes(x = MONTHS, y = BASE, label=BASE), 
              vjust=1.6, color="White", size=2.5) +
    # add secondary y-axis that is 100,000 times smaller
    scale_y_continuous(sec.axis = sec_axis(~.*0.000001, name = "Internet Percentage in %")) +
    labs(y = "Base", x = "Months")

1

Data

df <- structure(list(MONTHS = structure(c(17440, 17471, 17501, 17532, 17563, 17591, 17622, 17652, 17683, 17713, 17744, 17775, 17805), class = "Date"), BASE = c(40756228L, 41088219L, 41642601L, 42017111L, 42439446L, 42847468L, 43375319L, 43440484L, 43464735L, 43326823L, 43190949L, 43015301L, 42780071L), INTERNETUSERGREATERTHAN0KB = c(13380576L, 13224502L, 14044105L, 14239169L, 14011423L, 14736043L, 14487827L, 14460410L, 14632695L, 14896654L, 15019329L, 14141766L, 14209288L), INTERNETPERCENTAGE = c(33L, 32L, 34L, 34L, 33L, 34L, 33L, 33L, 34L, 34L, 35L, 33L, 33L), SMARTPHONE = c(11610216L, 11875033L, 12225965L, 12412010L, 12760251L, 12781082L, 13142400L, 13295826L, 13422476L, 13408216L, 13504339L, 13413596L, 13586438L), SMARTPHONEPERCENTAGE = c(28L, 29L, 29L, 30L, 30L, 30L, 30L, 31L, 31L, 31L, 31L, 31L, 32L), INTERNETUSAGEGREATERTHAN0KB4G = c(829095L, 969531L, 1181411L, 1339620L, 1474300L, 1733027L, 1871816L, 1967129L, 2117418L, 2288215L, 2453243L, 2624865L, 2817199L)), row.names = c(NA, 13L), class = "data.frame")

Explanation

The secondary y-axis is just visual. ggplot plots our geom_line() on the first y-axis (with values roughly around 33,000,000). The secondary y-axis is added later. You can see that if you check out

> ggplot_build(p)[[1]][[2]]
          y     x group PANEL colour size linetype alpha
1  33000000 17440     1     1    red    1        1    NA
2  32000000 17471     1     1    red    1        1    NA
3  34000000 17501     1     1    red    1        1    NA
4  34000000 17532     1     1    red    1        1    NA
5  33000000 17563     1     1    red    1        1    NA
6  34000000 17591     1     1    red    1        1    NA
7  33000000 17622     1     1    red    1        1    NA
8  33000000 17652     1     1    red    1        1    NA
9  34000000 17683     1     1    red    1        1    NA
10 34000000 17713     1     1    red    1        1    NA
11 35000000 17744     1     1    red    1        1    NA
12 33000000 17775     1     1    red    1        1    NA
13 33000000 17805     1     1    red    1        1    NA

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.