2

I have a huge data frame with multiple variable names following a sequence. To simplify I created an example with 8 variables, the last 5 variables follow a sequence in the column name: I5min_thresh.118, I5min_thresh.118.5, I5min_thresh.119, I5min_thresh.119.5, I5min_thresh.120).

The sequence in the variable names is just an example and can diverge, for example variable sequence name could be from 60 to 180 by 0.1 steps (in this example from 118 to 120 by 0.5 steps).

The reproducible data frame:

df<-data.frame(Event=c("yes","yes","yes","no","no","no","no","no","no"),
           mois=c(0.3,0.2,0.2,0.3,0.3,0.3,0.3,0.3,0.2),
           I_float=c(96.0,100.8,96.0,21.6,10.8,10.8,16.8,8.4,16.8),
           Imax.118=c(95.0,105.0,77.0,15.0,5.0,49.7,53.8,51.2,57.8),
           Imax.118.5=c(97.0,90.0,100.0,16.0,15.0,50.2,54.3,51.7,58.3),
           Imax.119=c(98.0,110.0,78.0,51.4,8.0,50.7,54.8,52.2,58.8),
           Imax.119.5=c(99.8,71.0,80.0,51.9,51.2,51.2,55.3,52.7,59.3),
           Imax.120=c(54.6,71.5,79.0,52.4,51.7,51.7,55.8,53.2,59.8))

This is how the data frame looks:

enter image description here

I would like to count for each Imax the following variables, and store it in a new data frame:

  • number of times I_float >= Imax if Event=yes, as variable TP.
  • number of times I_float < Imax if Event=yes, as variable FN
  • number of times I_float >= Imax if Event=no, as variable FP.
  • number of times I_float < Imax if Event=no, as variable TN.

The resulting data frame should look like the following, where Yintercept is equal to the sequence number cotained in the Imax variable:

enter image description here

For now I only managed to compute TP, FN, TN and FP for 1 variable, lets say for variable Imax.118 by indicating exactly the variable name in r code (Imax.118) (first row of previous example). I can not use manually method since I have hundreds of variables in the real data frame following a name sequence.

Any help will be highly appreciated.

2 Answers 2

2

Using gather we can make our data long, only keep the numbers and the dot of the original Imax columns, then group by on our Yintercept column and sum the amount of rows which return TRUE for the conditions specified for the TP, FN, TN and FP columns.

library(tidyverse)
df %>% 
  gather(Yintercept, val, -Event, -mois, -I_float) %>% 
  mutate(Yintercept = as.numeric(gsub("Imax\\.", "", Yintercept))) %>% 
  group_by(Yintercept) %>% 
  summarise(TP = sum(I_float > val & Event == "yes"),
            FN = sum(I_float < val & Event == "yes"),
            TN = sum(I_float < val & Event == "no"),
            FP = sum(I_float > val & Event == "no"))

  Yintercept    TP    FN    TN    FP
       <dbl> <int> <int> <int> <int>
1       118      2     1     4     2
2       118.     1     2     5     1
3       119      1     2     5     1
4       120.     2     1     6     0
5       120      3     0     6     0
Sign up to request clarification or add additional context in comments.

6 Comments

many thanks for the quick response, I was struggling with this..
your welcome. btw, you could consider to mark one answer as accepted. then all other SO users know this question has been answered sufficiently. also see: stackoverflow.com/help/someone-answers
Hi @Lennyy, now I'm trying to extract the string from a more complex variable name. Before I was working with Imax.n (where n was a number which we called Yintercept in the result data frame). Now I'm working with variables pattern I5min_thresh_m_n, for example I5min_thresh_-140_80). If i create a character variable string<-(I5min_thresh_-140_80), and run gsub("I5min_thresh\_", "", string), I obtain the desired result -140_80. When I replace gsub("Imax\\.", "", Yintercept) for gsub("I5min_thresh\_", "", Yintercept) , I obtain the message NAs introduced by coercion and 1 observation in df.
I don't understand what I am doing wrong with the syntax. Furthermore, is it possible to extract the m and n value and store it as 2 variables, m and n, in the result data frame? For Imax_thresh_-140_80 the desires result would be m=-140 and n=80, followed by the summarised values TP, FN, TN, FP)?). Thanks @Lennyy
I tried to write the gsub function in the form of gsub("^(?:[^_]+_){2}(.+?)","",string), the result is "140_80" (and no -140_80). Anyway, when I replace gsub("Imax\\.", "", Yintercept) for gsub("^(?:[^_]+_){2}(.+?)","",Yintercept), the result is the same (only 1 observation with Yintercept value NA, and NAs introduced by coercion warning)
|
2

One approach would be to use pivot_longer available with most recent version of tidyr to put into long format.

Then, use case_when to do comparisons and determine true/false positives/negatives each row.

After summarising by Yintercept and outcome, you can use pivot_wider to create the final result.

df %>%
  pivot_longer(cols = starts_with("Imax"), names_to = "Yintercept", names_pattern = "^Imax.(\\d.+)",
               names_ptypes = list(Yintercept = double())) %>%
  mutate(outcome = case_when((I_float >= value) & (Event == "yes") ~ "TP",
                             (I_float < value) & (Event == "yes") ~ "FN",
                             (I_float >= value) & (Event == "no") ~ "FP",
                             (I_float < value) & (Event == "no") ~ "TN")) %>%
  group_by(Yintercept, outcome) %>%
  summarise(count = n()) %>%
  pivot_wider(id_cols = Yintercept, names_from = "outcome", values_from = "count", values_fill = list(count = 0))

Output

# A tibble: 5 x 5
# Groups:   Yintercept [5]
  Yintercept    FN    FP    TN    TP
       <dbl> <int> <int> <int> <int>
1      118       1     2     4     2
2      118.5     2     1     5     1
3      119       2     1     5     1
4      119.5     1     0     6     2
5      120       0     0     6     3

3 Comments

Thanks! is it possible replace Imax in the output for Yintercept?
@RaülOo Yes, changed to Yintercept - see edited answer.
Thanks for the code, I will save both answers for future use.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.