Flag rows based on multiple conditions on specific columns in data.table

Question

I have a data.table with multiple columns of a variable "Performance" in specific years and a column named "ExPerf". I want to create a new column called FLAG which would indicate rows flagged for manual review based on these two conditions:

Any of the "Performance" columns has a negative value
The "ExPerf" column is different from any of the columns by more than 50%.

A mock data.table similar to the one I have:

library(data.table)
dt <- data.table(Id = c("N23", "N34", "N11", "N65", "N55", "N78", "N88"),
                 Name = c("ABCD", "ACBD", "ACCD", "ADBN", "ADDD", "DBCA", "CBDA"),
                 Type = c("T", "B", "B", "T", "T", "B", "B"),
                 Sold = c(500, 300, 350, 500, 350, 400, 450),
                 Bl = c(2000, 2100, 2000, 1500, 1890, 1900, 2000),
                 P_2016 = c(-200, 420, 800, 900, -10, 75, 400),
                 P_2017 = c(500, 300, -20, 700, 50, 80, 370),
                 P_2018 = c(1000, 400, 600, 800, 40, 500, 300),
                 EP_2019 = c(1500, 380, 500, 850, 30, 400, 350))
dt

Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019
N23 ABCD T   500  2000     -200      500       1000      1500
N34 ACBD B   300  2100     420       300       400       380
N11 ACCD B   350  2000     800       -20       600       500
N65 ADBN T   500  1500     900       700       800       850
N55 ADDD T   350  1890     -10       50        40        30
N78 DBCA B   400  1900     75        80        500       400
N88 CBDA B   450  2000     400       370       300       350

For this data.table the desired output would add the FLAG column as seen below:

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

Frank · Accepted Answer · 2019-07-15 15:41:20Z

Any of the performance columns has a negative value

The expected performance column is different from any of the performance columns by more than 50%.

In other words, there are common min and max bounds for these columns:

the min is max(0, ExpPerf*0.5)
the max is ExpPerf*1.5

So...

dt[, v := !Reduce(`&`, 
  lapply(.SD, between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5)
), .SDcols=grep("^Perf_", names(dt), value=TRUE)]

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019     v
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

How it works:

between checks if a column lies between the min and max
lapply applies the check to each column, returning a list
Reduce with & checks whether all columns meet the condition
! negates the result, so we identify cases where at least one column fails the condition

between, & and ! are vectorized operators, so we end up with a vector of results, one for each row. I would probably write this sequence in magrittr so the steps are simpler to follow:

library(magrittr)

dt[, v := .SD %>% 
  lapply(between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5) %>%
  Reduce(f=`&`) %>%
  not
, .SDcols=grep("^Perf_", names(dt), value=TRUE)]

not is a relabeling of !, offered by magrittr for convenience.

.SD is a special symbol for the subset of data operated on inside the j part of DT[i, j, by]. In this case, there is no i or by, so only .SDcols is subsetting (to select the columns of interest).

Comment

The code would be simpler if the OP chose to format the data in long format.
My answer uses the same steps as Gilean's, but is vectorised instead of calculating per row.

Gilean0709 · Accepted Answer · 2019-07-15 15:12:36Z

You can use the following code to check for your two conditions:

dt[, FLAG := any(.SD < 0 | .SD < ExpPerf_2019 - .5*ExpPerf_2019 | .SD > ExpPerf_2019 + .5*ExpPerf_2019),
   by = Id,
   .SDcols = grep("^Perf", colnames(dt), value = TRUE)
   ]

The result:

> dt
    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

Collectives™ on Stack Overflow

Flag rows based on multiple conditions on specific columns in data.table

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related