creating multiple datasets in R

Question

I have a huge data.frame with 5 variables (v1, v2, v3, v4, v5). I need to create several subsets based on a single variable. For example:

DATA
v1   v2    v3 ... 
1    1231  0.1
1    2653  0.3
1    4545  0.4
2    4545  0.6
2    3345  0.1
2    5675  0.7
3    6754  0.2
3    9989  0.85
3    3456  0.4
.
.
.
70000
70000
70000

I would like to create subsets for each value on v1 using a function that easily generates each dataset in an automated way since I have over 70000 measurements for this variable. Then, once I have the datasets, I would like to perform a correlation for v2 and v3 and have an output with the p-values and rho in separate columns. I'm sorry I haven't attempted any command yet, but I having troubles understanding how to generate the function.

Paul Hiemstra · Accepted Answer · 2012-11-12 10:33:45Z

2

The plyr package has some nice functions to perform this kind of analysis, most importantly right now ddply:

res = ddply(DF, .(v1), function(sub_data) {
   cor_result = cor.test(sub_data$v2, sub_data$v3)
   return(data.frame(p.value = cor_result$p.value, rho = cor_result$estimate))
})

> res
  v1   p.value       rho
1  1 0.1730489 0.9632826
2  2 0.2228668 0.9393458
3  3 0.5311018 0.6717314

Note that you need to use cor.test in order to also get the p value.

answered Nov 12, 2012 at 10:33

Paul Hiemstra

61.2k12 gold badges146 silver badges151 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Paul Hiemstra Over a year ago

plyr has nice syntax. Now it only needs to be as fast as data.table :)

Jilber Urbina · Accepted Answer · 2012-11-12 10:57:46Z

Here's an R Base solution

DF <- read.table(text="v1   v2    v3 
1    1231  0.1
1    2653  0.3
1    4545  0.4
2    4545  0.6
2    3345  0.1
2    5675  0.7
3    6754  0.2
3    9989  0.85
3    3456  0.4", header=TRUE)

# Correlations and P-values
Result <- sapply(split(DF[,-1], DF$v1), function(x)
        c(cor.test(x$v2, x$v3)$estimate, P.val=cor.test(x$v2, x$v3)$p.value))

Result
              1         2         3
cor   0.9632826 0.9393458 0.6717314
P.val 0.1730489 0.2228668 0.5311018

If you wanna add those Results to the original data.frame then use transform()

transform(DF, 
          correlation=rep(Result[1,], table(DF[,1])),
          Pval=rep(Result[2,], table(DF[,1])))
  v1   v2   v3 correlation      Pval
1  1 1231 0.10   0.9632826 0.1730489
2  1 2653 0.30   0.9632826 0.1730489
3  1 4545 0.40   0.9632826 0.1730489
4  2 4545 0.60   0.9393458 0.2228668
5  2 3345 0.10   0.9393458 0.2228668
6  2 5675 0.70   0.9393458 0.2228668
7  3 6754 0.20   0.6717314 0.5311018
8  3 9989 0.85   0.6717314 0.5311018
9  3 3456 0.40   0.6717314 0.5311018

Collectives™ on Stack Overflow

creating multiple datasets in R

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related