0

I hava a dataframe (df) with 16 numeric columns. Its rownames are gene IDs.

Something like this:

colnames (df)

 [1] "pval_hr6_vs_0"        "ratio_hr6_vs_0"       "adj_pval_hr6_vs_0"    "pval_day1_vs_0"       "ratio_day1_vs_0"      "adj_pval_day1_vs_0"  
 [7] "pval_day3_vs_0"       "ratio_day3_vs_0"      "adj_pval_day3_vs_0"   "pval_day7_vs_0"       "ratio_day7_vs_0"      "adj_pval_day7_vs_0"  
[13] "log2_ratio_hr6_vs_0"  "log2_ratio_day1_vs_0" "log2_ratio_day3_vs_0" "log2_ratio_day7_vs_0"

enter image description here

I wanted to filter rows based on values from the "adj .* " and "log2 .* " columns, so I'd filter the list of genes by adjusted p-value < 0.05 and fold-change > 1.30, and find the up-regulated ones. Because I wanted a sophisticated, clean dplyr solution, I've been trying:

df %>% filter_at (vars (starts_with ("adj")), any_vars (. < 0.05) && vars (starts_with ("log2")), any_vars (. > 1.30))

But it returns:

Error in any_vars(. < 0.05) && vars(starts_with("log2")) : 
  invalid 'x' type in 'x && y'

Does anyone have any idea how to filter rows from multiple, specific columns based on more than one codition? I greatly thank in advance for the help.

DATA

dput (head (df))

structure(list(pval_hr6_vs_0 = c(0.161401, 0.391671, 0.791075, 
0.0309453, 0.359688, 0.71241), ratio_hr6_vs_0 = c(0.881776, 0.959703, 
0.944951, 0.820747, 0.909414, 0.973911), adj_pval_hr6_vs_0 = c(0.61628628422198, 
0.732741131025558, 0.933678589291789, 0.614775043943269, 0.714368197824843, 
0.901268880950645), log2_ratio_hr6_vs_0 = c(-0.181515884379813, 
-0.0593400918883432, -0.0816885739026721, -0.284990523428165, 
-0.136990881023621, -0.0381381559732916), pval_day1_vs_0 = c(0.207777, 
0.41232, 0.0058607, 0.0553732, 0.00856075, 0.949443), ratio_day1_vs_0 = c(0.91995, 
1.02949, 1.57038, 0.879047, 1.23045, 1.00336), adj_pval_day1_vs_0 = c(0.409604022272727, 
0.629769973586899, 0.0270546630122406, 0.158151839578531, 0.0367421798614547, 
0.978376706860707), log2_ratio_day1_vs_0 = c(-0.120372643187602, 
0.0419298163514424, 0.651113704200117, -0.185987790890492, 0.299186034295348, 
0.0048393298085176), pval_day3_vs_0 = c(0.146526, 0.691183, 0.826029, 
0.362553, 0.779175, 0.439895), ratio_day3_vs_0 = c(0.87762, 1.0192, 
0.954112, 0.921447, 1.02938, 1.05708), adj_pval_day3_vs_0 = c(0.936945465536307, 
0.981313440965427, 0.992555570490216, 0.975143191361125, 0.987658414388281, 
0.975143191361125), log2_ratio_day3_vs_0 = c(-0.188331691329437, 
0.0274371827068511, -0.0677694655953243, -0.11802690790943, 0.0417756575607963, 
0.0800845642663144), pval_day7_vs_0 = c(0.593743, 0.776319, 0.949556, 
0.48749, 0.718931, 0.0863084), ratio_day7_vs_0 = c(0.953576, 
0.986492, 0.986573, 0.93962, 1.03788, 1.13281), adj_pval_day7_vs_0 = c(0.999568629333333, 
0.999568629333333, 0.999568629333333, 0.999568629333333, 0.999568629333333, 
0.999568629333333), log2_ratio_day7_vs_0 = c(-0.0685801689831581, 
-0.0196207434895931, -0.0195022899177364, -0.0898506731573727, 
0.0536396485004071, 0.17990590613271)), row.names = c("A1BG", 
"A1CF", "A2M", "A2ML1", "A4GALT", "A4GNT"), class = "data.frame")
4
  • Your columns encode stat (pval, ratio, adj_pval, log2_ratio) and time (hr6, day1, day3, day7), and presumably you are looking for matching pairs within time -- eg you presumably don't want rows with adj_pval < 0.05 for a different time than where lo2_ratio is > 1.3? I expect reshaping longer will make that simpler. Commented Jun 20, 2023 at 15:31
  • Yes, that's it. I want to find the genes in which the p-value < 0.05 and the log2_ratio > 1.30 for each time interval (hr6_vs_0, day1_vs_0, day3_vs_0, day7_vs_0). Commented Jun 20, 2023 at 15:34
  • So that we don't need to retype your data, and to reduce ambiguity, can you please run dput(head(YOUR_TABLE)) and paste the output into your question? Commented Jun 20, 2023 at 15:40
  • Ok....... done. Commented Jun 20, 2023 at 15:45

1 Answer 1

2

Here's an example of using pivot_longer to make the comparison easy. Do you need it back in the original format, or does keeping it "long" work?

library(tidyverse)

df |>
  rownames_to_column("gene") |>
  pivot_longer(
    cols = -gene,
    names_to = c(".value", "comparison"),
    names_pattern = "(.*pval|.*ratio)_(.*)"
  ) |>
  filter(
    adj_pval < 0.05 & log2_ratio > 1.3
  )

To prevent issues of character limits, I'm adding the explanation to the answer. This is definitely a more complex split using pivot_longer. The column names have 2 parts - the statistic (p-value, adjusted p-value, ratio, or log ratio), and the comparison. If I had used code such as

df |>
  rownames_to_column("gene") |>
  pivot_longer(
    cols = -gene,
    names_to = c("statistic", "comparison"),
    names_pattern = "(.*pval|.*ratio)_(.*)"
  ) |>
  filter(
    adj_pval < 0.05 & log2_ratio > 1.3
  )

the result would have been a data set with 3 columns - gene, statistic and p-value. What we really wanted though was a data set with 5 columns - one for each statistic, plus one identifying the comparison. This is what .value does - it's a special instruction to pivot_longer to create columns with names coming from some part of the old (wide-format) variable names. As the first part of the wide-format names give the statistic, .value is listed before "comparison" in the code.

As I think you caught, names_pattern tells pivot_longer what part of the column names identify the statistic and what identifies the comparison. The parentheses mark what elements to extract: The first part is (.*pval|.*ratio) (going to .value - and a new column name) and the 2nd part is (.*) becomes the content of comparison.

For the first part, I knew that all the statistics ended with "ratio" or "pval", although both adj_pval and log2_ratio have more information earlier. This part of the pattern thus looks for .*pval (the 0+ characters before pval, plus the text pval) OR (|) .*ratio (the 0+ characters before ratio, plus the text ratio).

I then skip the _ character (it's not in any parentheses), and everything else becomes the comparison.

Sign up to request clarification or add additional context in comments.

5 Comments

You might add a rownames_to_column() step and modify to cols = -rowname so that the row names (genes?) are preserved, and we can see which gene / time combinations satisfy the conditions.
good call @JonSpring! thanks for catching that
Hey, guys... I was reading the dplyr pivoting documentation, but I haven't been able to understand what you guys did here names_to = c(".value", "comparison"), names_pattern = "(.*pval|.*ratio)_(.*). What's happening in ".value" ?? And I couldn't understand the regex in names_pattern at all ... =(( could you explain it, pretty please? =(
I added a long explanation in the answer. Does this explain it well enough for you?
Wow!!! Thank you SO much for your explanation, but especially for your disposition in helping me out. I'm so grateful...!!!! ♥️ Thanks to you, now I understand it pretty well

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.