2

I have a data frame containing 92 variables and 1900 observations. Essentially I have species ID as my variable of interest and relative abundance as the other variables.

Columns 69:92 (23 columns) are quality control variables that range from 0-10. I'd like to filter my data so that each species has at least 4 quality control variables that are >5. The best thing I've come up with until now is not what I want but at least filters the data on the basis that I have at least 1 variable with QC>5 and the sum of the row is larger than 100:

df_QC <-  df %>%
  filter_at(vars(contains("QC")), any_vars(. >= 5))%>% 
  rowwise()%>% 
  mutate(total = sum(c_across(69:92)))
  filter(total >99.9) 

Is there a way to solve it the way I'd like?

To reiterate, I want to select species if QC> 5 for at least 4 of the QC variables.

My data is a bit large so let's work on a smaller dataset to replicate the problem. iris data but add some quality control variables:

df <-cbind(iris, data.frame(qc1 = sample(0:10, size=150, replace=TRUE),
                            qc2 = sample(0:10, size=150, replace=TRUE),
                            qc3 = sample(0:10, size=150, replace=TRUE),
                            qc4 = sample(0:10, size=150, replace=TRUE),
                            qc5 = sample(0:10, size=150, replace=TRUE)))

#Then similarly I would do a filtering that is not really what I want:

df_QC <-  df %>%
  filter_at(vars(contains("QC")), any_vars(. >= 5))%>% 
  rowwise()%>% 
  mutate(total = sum(c_across(6:10)))%>%
filter(total >15) 

So with this example data, how could I filter out Species with QC>5 for at least 3 QCs?

Thanks in advance for any help!

1 Answer 1

2

Here's a way to do it using rowSums. Basically counting the number of values >= 5 for every row, and checking if this number is higher or equal to 4.

dplyr

library(dplyr)
df %>% 
  filter(rowSums(.[str_c("qc", 1:5)] >= 5) >= 4)

base R

df[rowSums(df[paste0("qc", 1:5)] >= 5) >= 4, ]

output

   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species qc1 qc2 qc3 qc4 qc5
1           4.8         3.4          1.6         0.2     setosa   6   9   5  10  10
2           5.8         4.0          1.2         0.2     setosa   9   8   3   7   5
3           5.7         3.8          1.7         0.3     setosa   4  10   9   7   7
4           5.1         3.3          1.7         0.5     setosa   7   6  10   1   7
5           4.8         3.4          1.9         0.2     setosa   1   6   6   7   7
6           5.2         3.5          1.5         0.2     setosa   5   1   6   6   7
7           5.5         3.5          1.3         0.2     setosa  10   3   6   6   6
8           4.9         3.6          1.4         0.1     setosa   8   4  10   8   6
9           4.4         3.0          1.3         0.2     setosa   1   8   5   6  10
10          5.0         3.5          1.3         0.3     setosa   9   9   5   9   1
...
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much! Bit modified I made it work: ´´´df_PR <- df_srtd%>% filter(rowSums(.[,6:10] >=5) >=4)´´´ I dont Understand what the ´´´str_c´´´ does in this case and I found not make it work on my original data..
Oh. Right, my bad: it’s a function from the tidyr package w chick is basically paste.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.