1

Following this question How to divide between groups of rows using dplyr?.

If I have this data frame:

id = c("a","a","b","b","c","c")
condition = c(0,1,0,1,0,1)
gene1 = sample(1:100,6)
gene2 = sample(1:100,6)
#...
geneN = sample(1:100,6)

df = data.frame(id,condition,gene1,gene2,geneN)

I want to group by id and divide the value of rows with condition == 0 with those with condition == 1 to get this :

df[condition == 0,3:5]/ df[condition == 1,3:5]
#
      gene1     gene2     geneN
1 0.2187500 0.4946237 0.3750000
3 0.4700000 0.6382979 0.5444444
5 0.7674419 0.5471698 2.3750000

I can use dplyr as follows:

df %>% 
    group_by(id) %>%
    summarise(gene1 = gene1[condition == 0] / gene1[condition == 1],
              gene2 = gene2[condition == 0] / gene2[condition == 1],
              geneN = geneN[condition == 0] / geneN[condition == 1])

But I have e.g. 100 variables such as below. How can I do that without having to list all the genes.

id = c("a","a","b","b","c","c")
condition = c(0,1,0,1,0,1)
genes = matrix(1:600,ncol = 100)
df = data.frame(id,condition,genes)
2
  • please, can you revise your example and include "many variables" Commented Feb 13, 2018 at 15:23
  • Updated the question for that. Commented Feb 13, 2018 at 15:32

3 Answers 3

3

We can use summarise_atto apply the same operation to many columns.

library(dplyr)

df2 <- df %>%
  group_by(id) %>%
  arrange(condition) %>%
  summarise_at(vars(-condition), funs(first(.)/last(.))) %>%
  ungroup()
df2
# # A tibble: 3 x 4
#   id    gene1 gene2 geneN
#   <fct> <dbl> <dbl> <dbl>
# 1 a     0.524 2.28  0.654
# 2 b     1.65  0.616 1.38 
# 3 c     0.578 2.00  2.17 
Sign up to request clarification or add additional context in comments.

4 Comments

you might want to add an arrange to ensure that you're dividing the right rows, since first() and last() won't check for it.
@CPak Good idea. I will add that.
This answer is great, but it's very slow with larger data e.g. id = c("a","a","b","b","c","c"); condition = c(0,1,0,1,0,1); genes = matrix(1:30000,ncol = 5000); df = data.frame(id,condition,genes)
If that is the case, perhaps explore the solutions in data.table or use matrix for all the calculation.
1

You can try

df %>% 
  gather(k,v, -id, -condition) %>% 
  spread(condition, v) %>% 
  mutate(ratio=`0`/`1`) %>% 
  select(id, k, ratio) %>% 
  spread(k, ratio)
  id      gene1     gene2    geneN
1  a  0.3670886 0.5955056 1.192982
2  b  0.4767442 1.2222222 0.125000
3  c 18.2000000 2.0909091 6.000000

used your data with set.seed(123)

Comments

0

If your dataset is sorted and without irregularities you can do this using purr::map_dfr:

df[paste0("gene",c(1,2,"N"))] %>% map_dfr(~.x[c(F,T)]/.x[c(T,F)])
# # A tibble: 3 x 3
#       gene1    gene2      geneN
#       <dbl>    <dbl>      <dbl>
# 1 0.1764706 1.323944 38.5000000
# 2 0.4895833 0.531250  0.3478261
# 3 0.3278689 2.705882  1.2424242

Or its base R equivalent:

as.data.frame(lapply(df[paste0("gene",c(1,2,"N"))],function(x) x[c(F,T)]/x[c(T,F)]))

you may need to bind the observations, I skipped this step as it's not in your expected output.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.