Filter rows from specific columns based on two conditions with dplyr

Question

I hava a dataframe (df) with 16 numeric columns. Its rownames are gene IDs.

Something like this:

colnames (df)

 [1] "pval_hr6_vs_0"        "ratio_hr6_vs_0"       "adj_pval_hr6_vs_0"    "pval_day1_vs_0"       "ratio_day1_vs_0"      "adj_pval_day1_vs_0"  
 [7] "pval_day3_vs_0"       "ratio_day3_vs_0"      "adj_pval_day3_vs_0"   "pval_day7_vs_0"       "ratio_day7_vs_0"      "adj_pval_day7_vs_0"  
[13] "log2_ratio_hr6_vs_0"  "log2_ratio_day1_vs_0" "log2_ratio_day3_vs_0" "log2_ratio_day7_vs_0"

I wanted to filter rows based on values from the "adj .* " and "log2 .* " columns, so I'd filter the list of genes by adjusted p-value < 0.05 and fold-change > 1.30, and find the up-regulated ones. Because I wanted a sophisticated, clean dplyr solution, I've been trying:

df %>% filter_at (vars (starts_with ("adj")), any_vars (. < 0.05) && vars (starts_with ("log2")), any_vars (. > 1.30))

But it returns:

Error in any_vars(. < 0.05) && vars(starts_with("log2")) : 
  invalid 'x' type in 'x && y'

Does anyone have any idea how to filter rows from multiple, specific columns based on more than one codition? I greatly thank in advance for the help.

DATA

dput (head (df))

structure(list(pval_hr6_vs_0 = c(0.161401, 0.391671, 0.791075, 
0.0309453, 0.359688, 0.71241), ratio_hr6_vs_0 = c(0.881776, 0.959703, 
0.944951, 0.820747, 0.909414, 0.973911), adj_pval_hr6_vs_0 = c(0.61628628422198, 
0.732741131025558, 0.933678589291789, 0.614775043943269, 0.714368197824843, 
0.901268880950645), log2_ratio_hr6_vs_0 = c(-0.181515884379813, 
-0.0593400918883432, -0.0816885739026721, -0.284990523428165, 
-0.136990881023621, -0.0381381559732916), pval_day1_vs_0 = c(0.207777, 
0.41232, 0.0058607, 0.0553732, 0.00856075, 0.949443), ratio_day1_vs_0 = c(0.91995, 
1.02949, 1.57038, 0.879047, 1.23045, 1.00336), adj_pval_day1_vs_0 = c(0.409604022272727, 
0.629769973586899, 0.0270546630122406, 0.158151839578531, 0.0367421798614547, 
0.978376706860707), log2_ratio_day1_vs_0 = c(-0.120372643187602, 
0.0419298163514424, 0.651113704200117, -0.185987790890492, 0.299186034295348, 
0.0048393298085176), pval_day3_vs_0 = c(0.146526, 0.691183, 0.826029, 
0.362553, 0.779175, 0.439895), ratio_day3_vs_0 = c(0.87762, 1.0192, 
0.954112, 0.921447, 1.02938, 1.05708), adj_pval_day3_vs_0 = c(0.936945465536307, 
0.981313440965427, 0.992555570490216, 0.975143191361125, 0.987658414388281, 
0.975143191361125), log2_ratio_day3_vs_0 = c(-0.188331691329437, 
0.0274371827068511, -0.0677694655953243, -0.11802690790943, 0.0417756575607963, 
0.0800845642663144), pval_day7_vs_0 = c(0.593743, 0.776319, 0.949556, 
0.48749, 0.718931, 0.0863084), ratio_day7_vs_0 = c(0.953576, 
0.986492, 0.986573, 0.93962, 1.03788, 1.13281), adj_pval_day7_vs_0 = c(0.999568629333333, 
0.999568629333333, 0.999568629333333, 0.999568629333333, 0.999568629333333, 
0.999568629333333), log2_ratio_day7_vs_0 = c(-0.0685801689831581, 
-0.0196207434895931, -0.0195022899177364, -0.0898506731573727, 
0.0536396485004071, 0.17990590613271)), row.names = c("A1BG", 
"A1CF", "A2M", "A2ML1", "A4GALT", "A4GNT"), class = "data.frame")

Your columns encode stat (pval, ratio, adj_pval, log2_ratio) and time (hr6, day1, day3, day7), and presumably you are looking for matching pairs within time -- eg you presumably don't want rows with adj_pval < 0.05 for a different time than where lo2_ratio is > 1.3? I expect reshaping longer will make that simpler. — Jon Spring
– Jon Spring, Commented Jun 20, 2023 at 15:31
Yes, that's it. I want to find the genes in which the p-value < 0.05 and the log2_ratio > 1.30 for each time interval (hr6_vs_0, day1_vs_0, day3_vs_0, day7_vs_0). — ginn
– ginn, Commented Jun 20, 2023 at 15:34
So that we don't need to retype your data, and to reduce ambiguity, can you please run dput(head(YOUR_TABLE)) and paste the output into your question? — Jon Spring
– Jon Spring, Commented Jun 20, 2023 at 15:40

Melissa Key · Accepted Answer · 2023-06-21 14:08:27Z

2

Here's an example of using pivot_longer to make the comparison easy. Do you need it back in the original format, or does keeping it "long" work?

library(tidyverse)

df |>
  rownames_to_column("gene") |>
  pivot_longer(
    cols = -gene,
    names_to = c(".value", "comparison"),
    names_pattern = "(.*pval|.*ratio)_(.*)"
  ) |>
  filter(
    adj_pval < 0.05 & log2_ratio > 1.3
  )

To prevent issues of character limits, I'm adding the explanation to the answer. This is definitely a more complex split using pivot_longer. The column names have 2 parts - the statistic (p-value, adjusted p-value, ratio, or log ratio), and the comparison. If I had used code such as

df |>
  rownames_to_column("gene") |>
  pivot_longer(
    cols = -gene,
    names_to = c("statistic", "comparison"),
    names_pattern = "(.*pval|.*ratio)_(.*)"
  ) |>
  filter(
    adj_pval < 0.05 & log2_ratio > 1.3
  )

the result would have been a data set with 3 columns - gene, statistic and p-value. What we really wanted though was a data set with 5 columns - one for each statistic, plus one identifying the comparison. This is what .value does - it's a special instruction to pivot_longer to create columns with names coming from some part of the old (wide-format) variable names. As the first part of the wide-format names give the statistic, .value is listed before "comparison" in the code.

As I think you caught, names_pattern tells pivot_longer what part of the column names identify the statistic and what identifies the comparison. The parentheses mark what elements to extract: The first part is (.*pval|.*ratio) (going to .value - and a new column name) and the 2nd part is (.*) becomes the content of comparison.

For the first part, I knew that all the statistics ended with "ratio" or "pval", although both adj_pval and log2_ratio have more information earlier. This part of the pattern thus looks for .*pval (the 0+ characters before pval, plus the text pval) OR (|) .*ratio (the 0+ characters before ratio, plus the text ratio).

I then skip the _ character (it's not in any parentheses), and everything else becomes the comparison.

edited Jun 21, 2023 at 14:08

answered Jun 20, 2023 at 16:08

Melissa Key

4,55114 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jon Spring Over a year ago

You might add a rownames_to_column() step and modify to cols = -rowname so that the row names (genes?) are preserved, and we can see which gene / time combinations satisfy the conditions.

Melissa Key Over a year ago

good call @JonSpring! thanks for catching that

ginn Over a year ago

Hey, guys... I was reading the dplyr pivoting documentation, but I haven't been able to understand what you guys did here names_to = c(".value", "comparison"), names_pattern = "(.*pval|.*ratio)_(.*). What's happening in ".value" ?? And I couldn't understand the regex in names_pattern at all ... =(( could you explain it, pretty please? =(

Melissa Key Over a year ago

I added a long explanation in the answer. Does this explain it well enough for you?

ginn Over a year ago

Wow!!! Thank you SO much for your explanation, but especially for your disposition in helping me out. I'm so grateful...!!!! ♥️ Thanks to you, now I understand it pretty well

Collectives™ on Stack Overflow

Filter rows from specific columns based on two conditions with dplyr

DATA

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

DATA

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related