1

I have a dataset with user IDs and the number of objects they created. I drew the histogram using ggplot and now I'm trying to include the cumulative sum of the x-values as a line. The aim is to see much the bins contribute to the total number. I tried the following:

ggplot(data=userStats,aes(x=Num_Tours)) + geom_histogram(binwidth = 0.2)+
   scale_x_log10(name = 'Number of planned tours',breaks=c(1,5,10,50,100,200))+
   geom_line(aes(x=Num_Tours, y=cumsum(Num_Tours)/sum(Num_Tours)*3500),color="red")+
   scale_y_continuous(name = 'Number of users', sec.axis = sec_axis(~./3500, name = "Cummulative percentage of routes [%]"))

This does not work because I don't include any bins so the plot

and

ggplot(data=userStats,aes(x=Num_Tours)) + geom_histogram(binwidth = 0.2)+
   scale_x_log10(name = 'Number of planned tours',breaks=c(1,5,10,50,100,200))+
   stat_bin(aes(y=cumsum(..count..)),binwidth = 0.2, geom="line",color="red")+
   scale_y_continuous(name = 'Number of users', sec.axis = sec_axis(~./3500, name = "Cummulative percentage of routes [%]"))

Resulting in this: Result 1.

Here the cumsum of the count is considered. What I want is the cumsum of the count * value of the bin. Then it should be normalized, so that it can be displayed in one plot. What I am trying to to is something like that:

Example

I would appreciate any input! Thanks

Edit: As test data, this should work:

userID <- c(1:100)
Num_Tours <- sample(1:100,100)
userStats <- data.frame(userID,Num_Tours)
userStats$cumulative <- cumsum(userStats$Num_Tours/sum(userStats$Num_Tours))
1
  • example data please Commented Jun 4, 2017 at 12:16

1 Answer 1

2

Here is an illustrative example that could be helpful for you.

set.seed(111)
userID <- c(1:100)
Num_Tours <- sample(1:100, 100, replace=T)
userStats <- data.frame(userID, Num_Tours)

# Sorting x data
userStats$Num_Tours <- sort(userStats$Num_Tours)
userStats$cumulative <- cumsum(userStats$Num_Tours/sum(userStats$Num_Tours))

library(ggplot2)
# Fix manually the maximum value of y-axis
ymax <- 40
ggplot(data=userStats,aes(x=Num_Tours)) + 
   geom_histogram(binwidth = 0.2, col="white")+
   scale_x_log10(name = 'Number of planned tours',breaks=c(1,5,10,50,100,200))+
   geom_line(aes(x=Num_Tours,y=cumulative*ymax), col="red", lwd=1)+
   scale_y_continuous(name = 'Number of users', sec.axis = sec_axis(~./ymax, 
    name = "Cumulative percentage of routes [%]"))

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.