How to replace mutiple nested for loops with apply family functions in R?

Question

I have four main variables in my dataset (dat).

SubjectID
Group (can be Easy1, Easy2, Hard1, Hard2)
Object (x, y, z, w)
Reaction time

For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.

TUK <- function (a,b,c) {
....
}

Basically, the for loop logic would be:

for (i in dat$SubjectID):
for (j in dat$Group):
for (k in dat$Object) :
TUK(i,j,k)

How can I do this with apply function family?

Thank you!

Adding reproducible example:

SubjectID <- c(3772113,3772468)
Group <- c("Easy","Hard")
Object <- c("A","B")
dat <- data.frame(expand.grid(SubjectID,Group,Object))
dat$RT <- rnorm(8,1500,700)
colnames(dat) <- c("SubjectID","Group","Object","RT")

TUK <- function (SUBJ,GROUP,OBJECT){
  p <- dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]

  p[p$RT< 1000 | p$RT> 2000,] <- NA

  dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]<<- p
}

The apply family doesn't handle functions with multiple changing arguments/multiple groupings very well. Would you be open to a dplyr or data.table solution? — Gregor Thomas
– Gregor Thomas, Commented Apr 1, 2016 at 20:42
@rawr It doesn't work. Error in (function (SUBJ, GROUP, OBJECT) : unused arguments (a = dots[[1]][[1]], b = dots[[2]][[1]], c = dots[[3]][[1]]) — User33268
– User33268, Commented Apr 2, 2016 at 7:54

Gregor Thomas · Accepted Answer · 2016-04-03 17:36:16Z

A big part of your problem is that your TUK function is terrible. Here are some reasons why

Problem: it depends on having a data frame named dat in the global environment. Change the name of your data and it breaks.
- Solution: you should pass in all arguments needed. In this case, dat should be an argument.
Problem: Global assignment <<- should be avoided. There are certain advanced cases where it is necessary (e.g., sometimes in Shiny apps), but in general it makes a function behave in very un-R-like ways.
- Solution: Simply return() a value and assign it like any other normal R function.
Problem: It's over-complicated. You're by passing in SUBJ, GROUP, and OBJECT but only using them to subset you're trying to do inside your function the "grouping" bit that dplyr or data.table or base::ave excels at. It's as if you're trying to build you function in a way so that if could only possibly be used embedded in this particular for loop.
- Solution: Functions should be simple building blocks. Make this a function of just a single vector. It will be much cleaner and easier to debug. When it works on a single vector, use dplyr or data.table or ave (or even a for loop) to do the split-apply-combining of it. This also makes your function more generally useful instead of being cemented to this one particular case.

With the above in mind, here's an attempted re-write:

TUK2 <- function (RT){
  RT[RT < 1000 | RT > 2000] <- NA
  return(RT)
}

See how much simpler! Now if we want to apply this function to each of the GROUP:SUBJ:OBJECT groupings in your data, and replace the RT column with the result, we do this with dplyr:

library(dplyr)
group_by(dat, Group, SubjectID, Object) %>%
    mutate(new_RT = TUK2(RT))

dplyr does the grouping of data, the splitting of data, applies the simple function to each piece, and combines it all back together for us.

Now, in your question, you said

For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.

This doesn't sound much like what your function does. Based only on this description, I would code this as

group_by(dat, Group, SubjectID, Object) %>%
    mutate(new_RT = pmin(RT, quantile(RT, probs = 0.75) + 1.5 * IQR(RT)))

pmin is for parallel minimum, it's a vectorized way to take the smaller of two vectors. Try, e.g., pmin(1:10, 7), to see what it does.

In both examples, the dplyr data frame won't be saved, of course, unless you re-assign it with dat <- group_by(dat, ...) etc. This is the functional programming way of doing things - no global assignment.

One additional note: with the re-written function you could still use loops instead of dplyr. I don't know why you would - surely the dplyr syntax is nicer - but I just want to illustrate that the small building-block function is generally useful, it's not "baking in" dplyr in the way that your original function was "baking in" a particular for loop.

for (sub %in% unique(dat$SubjectID)) {
  for (obj %in% unique(dat$Object)) {
    for (grp %in% unique(dat$Group)) {
      dat[dat$SubjectID == sub & 
            dat$Object == obj & 
            dat$Group == grp, "RT"] <-
        TUK2(
          dat[dat$SubjectID == sub & 
                dat$Object == obj & 
                dat$Group == grp, "RT"]
        )
    }
  }
}

Thank you very much for this post, it's been a great help. I knew that global assignments were a no-go, but were a bit blind to see the other solution. I also didn't understand the group_by + mutate combination in dplyr until now. So far I used it only for summarizing. Thanks once again and God bless!

Collectives™ on Stack Overflow

How to replace mutiple nested for loops with apply family functions in R?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related