Correlation between columns where they share name

Question

I have a big dataframe with info of which site the data is from, a name, a code and then a empty row and then data.

Site Site1 Site1 Site1 Site1 Site2 Site2
Name  A     B      C     D    E      F
Code A12    A41   A32   A33   A21    A12 

-    3.2    4.2   5.2    3.1   1.11   2.52
-    9.2    0.21  3.12   3.03  2.12   5.12
....

So what I try to do is to find all values which have the same value in the "Code" row and either extract them to a seperat dataframe to then calculated the correlation or just calculated the correlation between them right there.

In the end I want something like:

SiteA  SiteB  Correlation NameA NameB
Site1  Site2    .87        A12   A12
Site4  Site8    .76        B32   B32

I dont know before hand how many sites I will have but Im only intressed in the correlation when the code is the same.

I tried to extract the info (the first 3 rows) in a seperated data frame and tried to put every site in its own dataframe but wasn't able to do so.

It would be great if you can use dput to share your data as a reproducible example. — www
– www, Commented Sep 21, 2017 at 12:11
Why is that data set up that way? The transposed version would make more sense for each column to be same atomic type (char, int, factor, num). — Parfait
– Parfait, Commented Sep 21, 2017 at 18:53

rr_silva · Accepted Answer · 2017-09-21 14:28:34Z

1

First you need to "rotate" your dataframe... You can use transpose() or reshape() and call the new dataframe DF, for example.

When you have the dataframe DF with columns called "Site", "Name", "Code", "X1", "X2" (etc), you can use:

DF$Code <- as.factor(DF$Code)

You'll get as many levels as the number of different codes (hope they are not too many). Then you just have to select the rows with the same code, like:

DF[which(DF$Code=="A12"),] 
DF[which(DF$Code=="B32") ,]

Because you have hundreds of different codes, you have to go a little further with your script... You may check how many times each code appear in your dataframe with:

table(DF$Code)

And get a vector with the +- 100 repeated codes with:

Dx <- as.data.frame(table(DF$Code))
Repeated_codes <- Dx$Var1[which(Dx$Freq)>1]

I will assume that each code only appears once or twice in your original data.

Create an empty dataframe to "put" the results of a loop:

final_df <- data.frame(Site_X=character(),
Site_Y=character(),
Correlation=integer(),
Name_X=character(),
Name_Y=character(),
stringsAsFactors=FALSE)

Then, you may use this:

for (i in c(1:length(Repeated_codes))){
Code_x <- Repeated_codes[i]
DFi <- DF[which(DF$Code == Code_x),]
cor_i <- cor(DFi$X1, DFi$X2) 
final_df[i,"Site_X"] <- DFi[1,"Site"]
final_df[i,"Site_Y"] <- DFi[2,"Site"]
final_df[i,"Correlation"] <- round(as.numeric(cor_i$estimate)
final_df[i,"Name_X"] <- DFi[1,"Code"]
final_df[i,"Name_Y"] <- DFi[2,"Code"]
}

It will "import" data from DF to final_df and give you the correlation coefficient.

edited Sep 21, 2017 at 14:28

answered Sep 21, 2017 at 13:15

rr_silva

528 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PrincessJellyfish Over a year ago

The thing is that there is around 1000 codes. I think around 100 of them have the same code.

Collectives™ on Stack Overflow

Correlation between columns where they share name

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest