1

I implemented my own correlation function in R. Surprisingly I get slightly different results when using the built-in cor function. The differences seem to disappear when n the number of observations are big enough.

My function:

corr = function(X) {
  Q = X - colMeans(X)
  S_ = colSums(Q**2)
  S = sqrt(S_ %*% t(S_))
  covarr = t(Q) %*% Q
  corrr_ = covarr / S
  return(corrr_)
}

library(mvtnorm)
set.seed(247)
X = rmvnorm(10, sigma = matrix(c(1,0.8,0.8,1), ncol=2)) # change 10 to 100, 1000, or 10000
corr(X)
cor(X)

For n=10 I get 0.8490966 vs. 0.8465363, so the change is in the 3rd decimal. For n=1000 I get 0.7960206 vs. 0.7960925, so the change is in the 5th decimal.

3
  • 1
    You can see the underlying code with body(cor). Commented Apr 29, 2023 at 14:43
  • Look into the C source code of C_cor that cor() calls internally. Commented Apr 29, 2023 at 14:49
  • @jay.sf I've looked, it's a complicated code, this is why I ask here. Commented Apr 29, 2023 at 16:03

1 Answer 1

3

The first line of the function should be this since R stores matrices column by column and not row by row

Q = t(t(X) - colMeans(X))

or

Q = X - matrix(colMeans(X), nrow(X), ncol(X), byrow = TRUE)

or

Q = scale(X, TRUE, FALSE)

or even this which is not the same Q but in the end gives the same answer

Q = scale(X)

If we use cov2cor from the base of R then

corr = function(X) {
  Q = scale(X)
  cov2cor(crossprod(Q))
}
Sign up to request clarification or add additional context in comments.

2 Comments

Yep. Or Q = sweep(X, 2, colMeans(X)).
Thanks. I always mix R and python "broadcasting"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.