1

I have four main variables in my dataset (dat).

  1. SubjectID
  2. Group (can be Easy1, Easy2, Hard1, Hard2)
  3. Object (x, y, z, w)
  4. Reaction time

For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.

TUK <- function (a,b,c) {
....
}

Basically, the for loop logic would be:

for (i in dat$SubjectID):
for (j in dat$Group):
for (k in dat$Object) :
TUK(i,j,k)

How can I do this with apply function family?

Thank you!

Adding reproducible example:

SubjectID <- c(3772113,3772468)
Group <- c("Easy","Hard")
Object <- c("A","B")
dat <- data.frame(expand.grid(SubjectID,Group,Object))
dat$RT <- rnorm(8,1500,700)
colnames(dat) <- c("SubjectID","Group","Object","RT")

TUK <- function (SUBJ,GROUP,OBJECT){
  p <- dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]

  p[p$RT< 1000 | p$RT> 2000,] <- NA

  dat[dat$SubjectID==SUBJ & dat$Group== GROUP & dat$Object==OBJECT, "RT"]<<- p
}
5
  • You should include a reproducible example in your question. Commented Apr 1, 2016 at 20:27
  • The apply family doesn't handle functions with multiple changing arguments/multiple groupings very well. Would you be open to a dplyr or data.table solution? Commented Apr 1, 2016 at 20:42
  • @Sotos Added Example Commented Apr 2, 2016 at 7:40
  • @Gregor Yes, dplyr would be good! Commented Apr 2, 2016 at 7:42
  • @rawr It doesn't work. Error in (function (SUBJ, GROUP, OBJECT) : unused arguments (a = dots[[1]][[1]], b = dots[[2]][[1]], c = dots[[3]][[1]]) Commented Apr 2, 2016 at 7:54

1 Answer 1

1

A big part of your problem is that your TUK function is terrible. Here are some reasons why

  • Problem: it depends on having a data frame named dat in the global environment. Change the name of your data and it breaks.

    • Solution: you should pass in all arguments needed. In this case, dat should be an argument.
  • Problem: Global assignment <<- should be avoided. There are certain advanced cases where it is necessary (e.g., sometimes in Shiny apps), but in general it makes a function behave in very un-R-like ways.

    • Solution: Simply return() a value and assign it like any other normal R function.
  • Problem: It's over-complicated. You're by passing in SUBJ, GROUP, and OBJECT but only using them to subset you're trying to do inside your function the "grouping" bit that dplyr or data.table or base::ave excels at. It's as if you're trying to build you function in a way so that if could only possibly be used embedded in this particular for loop.

    • Solution: Functions should be simple building blocks. Make this a function of just a single vector. It will be much cleaner and easier to debug. When it works on a single vector, use dplyr or data.table or ave (or even a for loop) to do the split-apply-combining of it. This also makes your function more generally useful instead of being cemented to this one particular case.

With the above in mind, here's an attempted re-write:

TUK2 <- function (RT){
  RT[RT < 1000 | RT > 2000] <- NA
  return(RT)
}

See how much simpler! Now if we want to apply this function to each of the GROUP:SUBJ:OBJECT groupings in your data, and replace the RT column with the result, we do this with dplyr:

library(dplyr)
group_by(dat, Group, SubjectID, Object) %>%
    mutate(new_RT = TUK2(RT))

dplyr does the grouping of data, the splitting of data, applies the simple function to each piece, and combines it all back together for us.


Now, in your question, you said

For each combination of variables 1, 2 and 3 I want to change the reaction time, so that all values above the 3rd Quartile + 1.5IQR are set to the value of 3rd Quartile + 1.5 IQR.

This doesn't sound much like what your function does. Based only on this description, I would code this as

group_by(dat, Group, SubjectID, Object) %>%
    mutate(new_RT = pmin(RT, quantile(RT, probs = 0.75) + 1.5 * IQR(RT)))

pmin is for parallel minimum, it's a vectorized way to take the smaller of two vectors. Try, e.g., pmin(1:10, 7), to see what it does.

In both examples, the dplyr data frame won't be saved, of course, unless you re-assign it with dat <- group_by(dat, ...) etc. This is the functional programming way of doing things - no global assignment.


One additional note: with the re-written function you could still use loops instead of dplyr. I don't know why you would - surely the dplyr syntax is nicer - but I just want to illustrate that the small building-block function is generally useful, it's not "baking in" dplyr in the way that your original function was "baking in" a particular for loop.

for (sub %in% unique(dat$SubjectID)) {
  for (obj %in% unique(dat$Object)) {
    for (grp %in% unique(dat$Group)) {
      dat[dat$SubjectID == sub & 
            dat$Object == obj & 
            dat$Group == grp, "RT"] <-
        TUK2(
          dat[dat$SubjectID == sub & 
                dat$Object == obj & 
                dat$Group == grp, "RT"]
        )
    }
  }
}
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much for this post, it's been a great help. I knew that global assignments were a no-go, but were a bit blind to see the other solution. I also didn't understand the group_by + mutate combination in dplyr until now. So far I used it only for summarizing. Thanks once again and God bless!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.