How to subset data frames based on top quartile of each column?

Question

#let's make some sample data first
names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric1_set1 <- c(2.5,3.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric1_set2 <- c(2.1,3.1,4.15,2.5,10,7.1,8.5,10,10.1,17.1,12.3,17.3,8,11,1.5) 
metric1_set3 <- c(12.1,13.1,4.15,2.5,10.5,7.1,2.5,10,7.1,11.1,12.3,17.3,8,1.45,1.5) 
dataset1 <- data.frame(names,metric1_set1,metric1_set2,metric1_set3)


names<- c("t1","t2","t3","t4","t5","t1","t2","t3","t4","t5","t1","t2","t3","t4","t5")
metric2_set1 <- c(21.5,13.1,4.5,2.5,12,7.1,8.5,10,10.1,17.8,12.3,11,10,14,1.5) 
metric2_set2 <- c(12.1,3.1,4.15,2.5,10,7.1,8.5,10,8.1,17.1,12.3,17.3,8,1.1,1.5) 
metric2_set3 <- c(2.1,13.1,4.15,2.5,10.5,7.1,21.5,10,7.1,11.1,12.3,12.3,8,1.45,1.5) 
dataset2 <- data.frame(names,metric2_set1,metric2_set2,metric2_set3)

Now the issue is to calculate the top quartile for each column of dataset1 and then pull out the corresponding names from dataset2. The idea is to get the correlation between these subsetted values.

quantiles <- apply(dataset1[2:4], 2, quantile, na.rm = TRUE)

Would obtain quartiles but the actual question is how to save names associated with let's say top qunatile of one dataset and drop every other row from both datasets.

Based on what @sconfluentus suggested we can change it to:

 topQuartile<-function(x){   #the function 
 y=quantile(x,  na.rm = TRUE )
 z=y[3]
 return(z)
 }
 quartile_daatset1<- apply( dataset1[2:4] , 2 , topQuartile  )

This perfectly works but I also need something similar to the following:

 topquartile_set1 <- subset(dataset1$metric1_set1, subset=(dataset1$metric1_set1 <= quant_daatset1[1]))

I need similar code that works for each column and puts all subsets together in a single final data frame.

Include the desired answer in your question so people know exactly what you want. — Mark Miller
– Mark Miller, Commented Jan 9, 2018 at 16:01
@MarkMiller I added a few lines, hopefully things are more clear now. — Jack
– Jack, Commented Jan 9, 2018 at 16:12

sconfluentus · Accepted Answer · 2018-01-09 02:26:16Z

0

The simplest way is to build a function with quantile in it, extract the fifth quantile within that function and return it to the apply like so:

fifthQuantile<-function(x){
  y=quantile(x,  na.rm = TRUE )
  z=y[5]
  return(z)
}

 quantiles<- apply( dataset1[2:4] , 2 , fifthQuantile )

This returns a data frame with your old column names as row names. If you would prefer that they are shaped the other way, try:

quantiles<- t(apply( dataset1[2:4] , 2 , fifthQuantile ))

This gives you a transposed data frame, with the columns where they were in the original!

answered Jan 9, 2018 at 2:26

sconfluentus

4,9811 gold badge24 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jack Over a year ago

That obtains the top quartile for each set which is what I want but as I mentioned I need to pull out the names (rows) corrosponding to fifthQuantile. Again the idea is to identify the names in each set which are in that range @sconfluentus

Jack Over a year ago

In other words, I need to somehow use calculated values to pull out the names. @sconfluentus

Jack Over a year ago

Thanks @sconfluentus & @lebelinoz! But that is not really what I asked for.

lebelinoz · Accepted Answer · 2018-01-09 02:28:40Z

0

I would start by gathering the data using the tidyr package:

library(tidyr)
df.gathered = gather(dataset1, key = "category", value = "value", -names)

Result:

names  category    value
--------------------------
 t1 metric1_set1  2.50
 t2 metric1_set1  3.10
 t3 metric1_set1  4.50
 t4 metric1_set1  2.50
 t5 metric1_set1 12.00
 t1 metric1_set1  7.10
 t2 metric1_set1  8.50
 t3 metric1_set1 10.00
 t4 metric1_set1 10.10
 t5 metric1_set1 17.80 
 ...  # and similar rows for metric1_set2 and metric1_set3 ...

You can then use the group_by feature in dplyr to get the top quantile from each name and category:

library(dplyr)
df.gathered %>% group_by(names, category) %>% summarise(Q1 = quantile(value, 1))

names   category    Q1
----------------------------
  t1 metric1_set1  12.3
  t1 metric1_set2  12.3
  t1 metric1_set3  12.3
  t2 metric1_set1  11.0
  t2 metric1_set2  17.3
  t2 metric1_set3  17.3
  ...

answered Jan 9, 2018 at 2:28

lebelinoz

5,09611 gold badges39 silver badges61 bronze badges

1 Comment

Jack Over a year ago

#lebelinoz I can calculate the quanties but I need to save corresponding row names and go from there ..

Collectives™ on Stack Overflow

How to subset data frames based on top quartile of each column?

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related