4

I have a data.table with multiple columns of a variable "Performance" in specific years and a column named "ExPerf". I want to create a new column called FLAG which would indicate rows flagged for manual review based on these two conditions:

  1. Any of the "Performance" columns has a negative value
  2. The "ExPerf" column is different from any of the columns by more than 50%.

A mock data.table similar to the one I have:

library(data.table)
dt <- data.table(Id = c("N23", "N34", "N11", "N65", "N55", "N78", "N88"),
                 Name = c("ABCD", "ACBD", "ACCD", "ADBN", "ADDD", "DBCA", "CBDA"),
                 Type = c("T", "B", "B", "T", "T", "B", "B"),
                 Sold = c(500, 300, 350, 500, 350, 400, 450),
                 Bl = c(2000, 2100, 2000, 1500, 1890, 1900, 2000),
                 P_2016 = c(-200, 420, 800, 900, -10, 75, 400),
                 P_2017 = c(500, 300, -20, 700, 50, 80, 370),
                 P_2018 = c(1000, 400, 600, 800, 40, 500, 300),
                 EP_2019 = c(1500, 380, 500, 850, 30, 400, 350))
dt

Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019
N23 ABCD T   500  2000     -200      500       1000      1500
N34 ACBD B   300  2100     420       300       400       380
N11 ACCD B   350  2000     800       -20       600       500
N65 ADBN T   500  1500     900       700       800       850
N55 ADDD T   350  1890     -10       50        40        30
N78 DBCA B   400  1900     75        80        500       400
N88 CBDA B   450  2000     400       370       300       350

For this data.table the desired output would add the FLAG column as seen below:

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

2 Answers 2

6
+100
  1. Any of the performance columns has a negative value
  2. The expected performance column is different from any of the performance columns by more than 50%.

In other words, there are common min and max bounds for these columns:

  • the min is max(0, ExpPerf*0.5)
  • the max is ExpPerf*1.5

So...

dt[, v := !Reduce(`&`, 
  lapply(.SD, between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5)
), .SDcols=grep("^Perf_", names(dt), value=TRUE)]

    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019     v
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

How it works:

  • between checks if a column lies between the min and max
  • lapply applies the check to each column, returning a list
  • Reduce with & checks whether all columns meet the condition
  • ! negates the result, so we identify cases where at least one column fails the condition

between, & and ! are vectorized operators, so we end up with a vector of results, one for each row. I would probably write this sequence in magrittr so the steps are simpler to follow:

library(magrittr)

dt[, v := .SD %>% 
  lapply(between, pmax(0, ExpPerf_2019*0.5), ExpPerf_2019*1.5) %>%
  Reduce(f=`&`) %>%
  not
, .SDcols=grep("^Perf_", names(dt), value=TRUE)]

not is a relabeling of !, offered by magrittr for convenience.

.SD is a special symbol for the subset of data operated on inside the j part of DT[i, j, by]. In this case, there is no i or by, so only .SDcols is subsetting (to select the columns of interest).

Comment

  • The code would be simpler if the OP chose to format the data in long format.
  • My answer uses the same steps as Gilean's, but is vectorised instead of calculating per row.
Sign up to request clarification or add additional context in comments.

Comments

2

You can use the following code to check for your two conditions:

dt[, FLAG := any(.SD < 0 | .SD < ExpPerf_2019 - .5*ExpPerf_2019 | .SD > ExpPerf_2019 + .5*ExpPerf_2019),
   by = Id,
   .SDcols = grep("^Perf", colnames(dt), value = TRUE)
   ]

The result:

> dt
    Id Name Type Sold Baseline Perf_2016 Perf_2017 Perf_2018 ExpPerf_2019  FLAG
1: N23 ABCD    T  500     2000      -200       500      1000         1500  TRUE
2: N34 ACBD    B  300     2100       420       300       400          380 FALSE
3: N11 ACCD    B  350     2000       800       -20       600          500  TRUE
4: N65 ADBN    T  500     1500       900       700       800          850 FALSE
5: N55 ADDD    T  350     1890       -10        50        40           30  TRUE
6: N78 DBCA    B  400     1900        75        80       500          400  TRUE
7: N88 CBDA    B  450     2000       400       370       300          350 FALSE

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.