2

I have a huge data.frame with 5 variables (v1, v2, v3, v4, v5). I need to create several subsets based on a single variable. For example:

DATA
v1   v2    v3 ... 
1    1231  0.1
1    2653  0.3
1    4545  0.4
2    4545  0.6
2    3345  0.1
2    5675  0.7
3    6754  0.2
3    9989  0.85
3    3456  0.4
.
.
.
70000
70000
70000

I would like to create subsets for each value on v1 using a function that easily generates each dataset in an automated way since I have over 70000 measurements for this variable. Then, once I have the datasets, I would like to perform a correlation for v2 and v3 and have an output with the p-values and rho in separate columns. I'm sorry I haven't attempted any command yet, but I having troubles understanding how to generate the function.

2 Answers 2

2

The plyr package has some nice functions to perform this kind of analysis, most importantly right now ddply:

res = ddply(DF, .(v1), function(sub_data) {
   cor_result = cor.test(sub_data$v2, sub_data$v3)
   return(data.frame(p.value = cor_result$p.value, rho = cor_result$estimate))
})

> res
  v1   p.value       rho
1  1 0.1730489 0.9632826
2  2 0.2228668 0.9393458
3  3 0.5311018 0.6717314

Note that you need to use cor.test in order to also get the p value.

Sign up to request clarification or add additional context in comments.

1 Comment

plyr has nice syntax. Now it only needs to be as fast as data.table :)
2

Here's an R Base solution

DF <- read.table(text="v1   v2    v3 
1    1231  0.1
1    2653  0.3
1    4545  0.4
2    4545  0.6
2    3345  0.1
2    5675  0.7
3    6754  0.2
3    9989  0.85
3    3456  0.4", header=TRUE)

# Correlations and P-values
Result <- sapply(split(DF[,-1], DF$v1), function(x)
        c(cor.test(x$v2, x$v3)$estimate, P.val=cor.test(x$v2, x$v3)$p.value))

Result
              1         2         3
cor   0.9632826 0.9393458 0.6717314
P.val 0.1730489 0.2228668 0.5311018

If you wanna add those Results to the original data.frame then use transform()

transform(DF, 
          correlation=rep(Result[1,], table(DF[,1])),
          Pval=rep(Result[2,], table(DF[,1])))
  v1   v2   v3 correlation      Pval
1  1 1231 0.10   0.9632826 0.1730489
2  1 2653 0.30   0.9632826 0.1730489
3  1 4545 0.40   0.9632826 0.1730489
4  2 4545 0.60   0.9393458 0.2228668
5  2 3345 0.10   0.9393458 0.2228668
6  2 5675 0.70   0.9393458 0.2228668
7  3 6754 0.20   0.6717314 0.5311018
8  3 9989 0.85   0.6717314 0.5311018
9  3 3456 0.40   0.6717314 0.5311018

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.