Update outside scope variable with apply function

Question

I'm trying to loop through a data.table and do certain processing on the data:

provide output which is based on the combined output from each row processed
record various details of the processing in a separate table called statsTable which gets updated at this and other stages of the process

The actual processing is more complex (with records being included in the output for each iteration of apply) and with bigger volumes than the code below which I have simplified right down for this question.

However, I can't see how to update the statsTable as lapply prevents this from happening (by design I believe so that functions can't have unintended consequences - so the processing time remains at zero). Is there a way to do this and still use one of the the apply functions? I know I can use a for loop but would prefer not to if possible.

mainTable <- data.table(year = rep(2016:2020), value = runif(5, min=0, max=50000000))
statsTable <- data.table(year = rep(2016:2020), procTime = 0)
setkey(statsTable, year) 

output <- bind_rows(lapply(mainTable$year, function(fileYear) {
  randomValue = as.integer(mainTable[year == fileYear]$value)
  print(paste0(fileYear, ":", randomValue))
  start <- proc.time()[[3]]
  for(i in 1:randomValue) {}
  elapsed = proc.time()[[3]]- start
  statsTable[year == fileYear]$procTime = elapsed
  print(elapsed)
  data.table(year = fileYear, loopsPerSecond = randomValue / elapsed)
}))
print(output)
print(statsTable)

Liman · Accepted Answer · 2020-09-06 21:29:32Z

3

One way to reach a variable outside apply functions could be the <<- operator, which reaches the parent environments. if you change the line

statsTable[year == fileYear]$procTime <- elapsed

to

statsTable[year == fileYear]$procTime <<- elapsed

you should be able to update the statsTable variable.

# print(statsTable)
#   year procTime
# 1: 2016    1.071
# 2: 2017    0.496
# 3: 2018    0.623
# 4: 2019    0.771
# 5: 2020    0.941

answered Sep 6, 2020 at 21:29

Liman

1,3107 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

MichaelChirico · Accepted Answer · 2020-09-07 02:05:12Z

2

You're using data.table but are a bit light on some canonical data.table, so first I will recommend some vignettes to brush up on data.table essentials, see here: https://github.com/Rdatatable/data.table/wiki/Getting-started

In particular, given that statsTable is a data.table, you should not use = to assign rows to statsTable, but instead use the data.table assignment operator, :=:

statsTable[ , procTime := elapsed]

This skirts around the issue in your original question because assignment is done without copies, so <<- is not needed (however, it is useful to know about <<- for more general use cases like yours where assignment should happen outside the local scope, but with the caveat that this should really come up very rarely).

Using more canonical data.table I would rewrite your analysis as:

mainTable[ , by=year, {
  randomValue = as.integer(value)
  cat(sprintf('%d:%d\n', .BY$year, randomValue))
  start <- proc.time()[[3L]]
  for(i in 1:randomValue) {}
  elapsed = proc.time()[[3L]]- start
  statsTable[.BY, procTime := elapsed]
  print(elapsed)
  .(loopsPerSecond = randomValue / elapsed)
}]

(on your example, this runs substantially slower than your original code due to this somewhat technical issue)

edited Sep 7, 2020 at 2:05

answered Sep 7, 2020 at 1:47

MichaelChirico

34.9k17 gold badges122 silver badges209 bronze badges

6 Comments

MichaelChirico Over a year ago

@Liman you mentioned your actual use case is different from your sample code. Please have a look & try to adapt my suggestion to your own code & see if you are affected by the performance degradation. If so, I can suggest an alternative that doesn't fall victim to the same issue.

Liman Over a year ago

I understand that the main issue raised by @Chris is that the column procTime of the data.table statsTable is not getting updated inside lapply. Specifically, the question is Is there a way to do this and still use one of the the apply functions?. So, I do not really understand your point when you said my actual use case is different from my sample code.

MichaelChirico Over a year ago

@Liman I'm referring to this paragraph in your Q "The actual processing is more complex"

Liman Over a year ago

Ah @MichaelChirico! @Chris asked the question, not me :)

Chris Over a year ago

@MichaelChirico (and Liman), thanks, this is really interesting - I've looked at the vignette and can see there is more to DT than I realised. I can see you've replaced the apply with the data.table function that iterates over the table. I just have a few questions. If I want to merge the rows created a result of each iteration (as per my original), would I need to include a row bind to another data.table within each? I can't see in the syntax, the .(assignment) fits inside the {}. Do you mean this approach runs slower and is there a way around that? Thanks, Chris.

|

Uwe · Accepted Answer · 2020-09-09 07:59:25Z

Perhaps, lapply() is not the best choice to meet OP's expectations.

The OP expects two results from his operation,

the result output and
an updated version of statsTable

Unfortunately, according to the documentation, lapply() returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

Instead of twisting lapply(), I suggest to use a for loop to iterate over mainTable$year and to update both results simultaneously:

out_list <- vector("list", length(mainTable$year))
for (idx in seq_along(mainTable$year)) {
  fileYear <- mainTable$year[idx]
  randomValue = as.integer(mainTable[idx, "value"])
  cat(fileYear, ":", randomValue, "\n")
  start <- proc.time()[[3]]
  for(i in 1:randomValue) {}
  elapsed = proc.time()[[3]]- start
  statsTable[year == fileYear]$procTime = elapsed
  cat(elapsed, "\n")
  out_list[[idx]] <- data.table(year = fileYear, loopsPerSecond = randomValue / elapsed)
}
output <- rbindlist(out_list)
print(output)

   year loopsPerSecond
1: 2016       71127692
2: 2017       79373691
3: 2018       96125167
4: 2019       90166990
5: 2020       83897274

print(statsTable)

   year procTime
1: 2016     0.24
2: 2017     0.11
3: 2018     0.03
4: 2019     0.29
5: 2020     0.38

out_list <- vector("list", length(mainTable$year)) initialises an empty list with as many slots as required to store the results. This will avoid to grow an object in a loop which is considered bad practice as it may deteriorate performance.

Collectives™ on Stack Overflow

Update outside scope variable with apply function

3 Answers 3

Comments

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related