Counting multiple variables separately in data frame which variable name contains a sequence

Question

I have a huge data frame with multiple variable names following a sequence. To simplify I created an example with 8 variables, the last 5 variables follow a sequence in the column name: I5min_thresh.118, I5min_thresh.118.5, I5min_thresh.119, I5min_thresh.119.5, I5min_thresh.120).

The sequence in the variable names is just an example and can diverge, for example variable sequence name could be from 60 to 180 by 0.1 steps (in this example from 118 to 120 by 0.5 steps).

The reproducible data frame:

df<-data.frame(Event=c("yes","yes","yes","no","no","no","no","no","no"),
           mois=c(0.3,0.2,0.2,0.3,0.3,0.3,0.3,0.3,0.2),
           I_float=c(96.0,100.8,96.0,21.6,10.8,10.8,16.8,8.4,16.8),
           Imax.118=c(95.0,105.0,77.0,15.0,5.0,49.7,53.8,51.2,57.8),
           Imax.118.5=c(97.0,90.0,100.0,16.0,15.0,50.2,54.3,51.7,58.3),
           Imax.119=c(98.0,110.0,78.0,51.4,8.0,50.7,54.8,52.2,58.8),
           Imax.119.5=c(99.8,71.0,80.0,51.9,51.2,51.2,55.3,52.7,59.3),
           Imax.120=c(54.6,71.5,79.0,52.4,51.7,51.7,55.8,53.2,59.8))

This is how the data frame looks:

I would like to count for each Imax the following variables, and store it in a new data frame:

number of times I_float >= Imax if Event=yes, as variable TP.
number of times I_float < Imax if Event=yes, as variable FN
number of times I_float >= Imax if Event=no, as variable FP.
number of times I_float < Imax if Event=no, as variable TN.

The resulting data frame should look like the following, where Yintercept is equal to the sequence number cotained in the Imax variable:

For now I only managed to compute TP, FN, TN and FP for 1 variable, lets say for variable Imax.118 by indicating exactly the variable name in r code (Imax.118) (first row of previous example). I can not use manually method since I have hundreds of variables in the real data frame following a name sequence.

Any help will be highly appreciated.

Lennyy · Accepted Answer · 2020-04-07 21:18:43Z

2

Using gather we can make our data long, only keep the numbers and the dot of the original Imax columns, then group by on our Yintercept column and sum the amount of rows which return TRUE for the conditions specified for the TP, FN, TN and FP columns.

library(tidyverse)
df %>% 
  gather(Yintercept, val, -Event, -mois, -I_float) %>% 
  mutate(Yintercept = as.numeric(gsub("Imax\\.", "", Yintercept))) %>% 
  group_by(Yintercept) %>% 
  summarise(TP = sum(I_float > val & Event == "yes"),
            FN = sum(I_float < val & Event == "yes"),
            TN = sum(I_float < val & Event == "no"),
            FP = sum(I_float > val & Event == "no"))

  Yintercept    TP    FN    TN    FP
       <dbl> <int> <int> <int> <int>
1       118      2     1     4     2
2       118.     1     2     5     1
3       119      1     2     5     1
4       120.     2     1     6     0
5       120      3     0     6     0

edited Apr 7, 2020 at 21:18

answered Apr 7, 2020 at 21:10

Lennyy

6,1522 gold badges13 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Raül Oo Over a year ago

many thanks for the quick response, I was struggling with this..

Lennyy Over a year ago

your welcome. btw, you could consider to mark one answer as accepted. then all other SO users know this question has been answered sufficiently. also see: stackoverflow.com/help/someone-answers

Raül Oo Over a year ago

Hi @Lennyy, now I'm trying to extract the string from a more complex variable name. Before I was working with Imax.n (where n was a number which we called Yintercept in the result data frame). Now I'm working with variables pattern I5min_thresh_m_n, for example I5min_thresh_-140_80). If i create a character variable string<-(I5min_thresh_-140_80), and run gsub("I5min_thresh\_", "", string), I obtain the desired result -140_80. When I replace gsub("Imax\\.", "", Yintercept) for gsub("I5min_thresh\_", "", Yintercept) , I obtain the message NAs introduced by coercion and 1 observation in df.

Raül Oo Over a year ago

I don't understand what I am doing wrong with the syntax. Furthermore, is it possible to extract the m and n value and store it as 2 variables, m and n, in the result data frame? For Imax_thresh_-140_80 the desires result would be m=-140 and n=80, followed by the summarised values TP, FN, TN, FP)?). Thanks @Lennyy

Raül Oo Over a year ago

I tried to write the gsub function in the form of gsub("^(?:[^_]+_){2}(.+?)","",string), the result is "140_80" (and no -140_80). Anyway, when I replace gsub("Imax\\.", "", Yintercept) for gsub("^(?:[^_]+_){2}(.+?)","",Yintercept), the result is the same (only 1 observation with Yintercept value NA, and NAs introduced by coercion warning)

|

Ben · Accepted Answer · 2020-04-08 12:24:06Z

2

One approach would be to use pivot_longer available with most recent version of tidyr to put into long format.

Then, use case_when to do comparisons and determine true/false positives/negatives each row.

After summarising by Yintercept and outcome, you can use pivot_wider to create the final result.

df %>%
  pivot_longer(cols = starts_with("Imax"), names_to = "Yintercept", names_pattern = "^Imax.(\\d.+)",
               names_ptypes = list(Yintercept = double())) %>%
  mutate(outcome = case_when((I_float >= value) & (Event == "yes") ~ "TP",
                             (I_float < value) & (Event == "yes") ~ "FN",
                             (I_float >= value) & (Event == "no") ~ "FP",
                             (I_float < value) & (Event == "no") ~ "TN")) %>%
  group_by(Yintercept, outcome) %>%
  summarise(count = n()) %>%
  pivot_wider(id_cols = Yintercept, names_from = "outcome", values_from = "count", values_fill = list(count = 0))

Output

# A tibble: 5 x 5
# Groups:   Yintercept [5]
  Yintercept    FN    FP    TN    TP
       <dbl> <int> <int> <int> <int>
1      118       1     2     4     2
2      118.5     2     1     5     1
3      119       2     1     5     1
4      119.5     1     0     6     2
5      120       0     0     6     3

edited Apr 8, 2020 at 12:24

answered Apr 7, 2020 at 21:10

Ben

30.7k5 gold badges28 silver badges52 bronze badges

3 Comments

Raül Oo Over a year ago

Thanks! is it possible replace Imax in the output for Yintercept?

Ben Over a year ago

@RaülOo Yes, changed to Yintercept - see edited answer.

Raül Oo Over a year ago

Thanks for the code, I will save both answers for future use.

Collectives™ on Stack Overflow

Counting multiple variables separately in data frame which variable name contains a sequence

2 Answers 2

6 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related