9

I want to add a density line (a normal density actually) to a histogram.

Suppose I have the following data. I can plot the histogram by ggplot2:

set.seed(123)    
df <- data.frame(x = rbeta(10000, shape1 = 2, shape2 = 4))

ggplot(df, aes(x = x)) + geom_histogram(colour = "black", fill = "white", 
                                        binwidth = 0.01) 

enter image description here

I can add a density line using:

ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y = ..density..),colour = "black", fill = "white", 
                 binwidth = 0.01) + 
  stat_function(fun = dnorm, args = list(mean = mean(df$x), sd = sd(df$x)))

enter image description here

But this is not what I actually want, I want this density line to be fitted to the count data.

I found a similar post (HERE) that offered a solution to this problem. But it did not work in my case. I need to an arbitrary expansion factor to get what I want. And this is not generalizable at all:

ef <- 100 # Expansion factor

ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01) + 
  stat_function(fun = function(x, mean, sd, n){ 
    n * dnorm(x = x, mean = mean, sd = sd)}, 
    args = list(mean = mean(df$x), sd = sd(df$x), n = ef))

enter image description here

Any clues that I can use to generalize this

  • first to normal distribution,
  • then to any other bin size,
  • and lastly to any other distribution will be very helpful.
2
  • Did you mean for the two chart images to be the same? Looks like you uploaded the same one twice Commented Dec 26, 2014 at 21:36
  • Use fitdistr(...) in the MASS package to fit the distributions. Commented Dec 26, 2014 at 21:52

1 Answer 1

15

Fitting a distribution function does not happen by magic. You have to do it explicitly. One way is using fitdistr(...) in the MASS package.

library(MASS)    # for fitsidtr(...)
# excellent fit (of course...)
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dbeta,args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

# horrible fit - no surprise here
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dnorm,args=fitdistr(df$x,"normal")$estimate)

# mediocre fit - also not surprising...
ggplot(df, aes(x = x)) + 
  geom_histogram(aes(y=..density..),colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=dgamma,args=fitdistr(df$x,"gamma")$estimate)

EDIT: Response to OP's comment.

The scale factor is binwidth ✕ sample size.

ggplot(df, aes(x = x)) + 
  geom_histogram(colour = "black", fill = "white", binwidth = 0.01)+
  stat_function(fun=function(x,shape1,shape2)0.01*nrow(df)*dbeta(x,shape1,shape2),
                args=fitdistr(df$x,"beta",start=list(shape1=1,shape2=1))$estimate)

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the generalization to different distributions. My ultimate aim is to fit these lines to the count data instead of density. Do you have any insights on how to do that? (I want to obtain same plot as the third plot of the original post.)
The 0.01 value in the formula (0.01*nrow(df)*dbeta(x,shape1,shape2)) is not generalizable to different binwidths or sample sizes. Suppose I have a sample size 2474 (instead of 10000) and 0.03 (instead of 0.01). I believe that 0.01 should be the function of bin width and possibly the sample size.
It absolutely is generalizable. The 0.01 is the binwidth. So if you use binwidth=0.03, then you would use fun=function(x,shape1,shape2)0.03*nrow(df)*dbeta(x,shape1,shape2) in the call to stat_function(...). Of course the counts would all change.
My mistake, it is generalizable to different bin sizes and sample sizes. Thank you for the solution.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.