1

I'm trying to loop through a data.table and do certain processing on the data:

  • provide output which is based on the combined output from each row processed

  • record various details of the processing in a separate table called statsTable which gets updated at this and other stages of the process

The actual processing is more complex (with records being included in the output for each iteration of apply) and with bigger volumes than the code below which I have simplified right down for this question.

However, I can't see how to update the statsTable as lapply prevents this from happening (by design I believe so that functions can't have unintended consequences - so the processing time remains at zero). Is there a way to do this and still use one of the the apply functions? I know I can use a for loop but would prefer not to if possible.

mainTable <- data.table(year = rep(2016:2020), value = runif(5, min=0, max=50000000))
statsTable <- data.table(year = rep(2016:2020), procTime = 0)
setkey(statsTable, year) 

output <- bind_rows(lapply(mainTable$year, function(fileYear) {
  randomValue = as.integer(mainTable[year == fileYear]$value)
  print(paste0(fileYear, ":", randomValue))
  start <- proc.time()[[3]]
  for(i in 1:randomValue) {}
  elapsed = proc.time()[[3]]- start
  statsTable[year == fileYear]$procTime = elapsed
  print(elapsed)
  data.table(year = fileYear, loopsPerSecond = randomValue / elapsed)
}))
print(output)
print(statsTable)

3 Answers 3

3

One way to reach a variable outside apply functions could be the <<- operator, which reaches the parent environments. if you change the line

statsTable[year == fileYear]$procTime <- elapsed

to

statsTable[year == fileYear]$procTime <<- elapsed

you should be able to update the statsTable variable.

# print(statsTable)
#   year procTime
# 1: 2016    1.071
# 2: 2017    0.496
# 3: 2018    0.623
# 4: 2019    0.771
# 5: 2020    0.941
Sign up to request clarification or add additional context in comments.

Comments

2

You're using data.table but are a bit light on some canonical data.table, so first I will recommend some vignettes to brush up on data.table essentials, see here: https://github.com/Rdatatable/data.table/wiki/Getting-started

In particular, given that statsTable is a data.table, you should not use = to assign rows to statsTable, but instead use the data.table assignment operator, :=:

statsTable[ , procTime := elapsed]

This skirts around the issue in your original question because assignment is done without copies, so <<- is not needed (however, it is useful to know about <<- for more general use cases like yours where assignment should happen outside the local scope, but with the caveat that this should really come up very rarely).

Using more canonical data.table I would rewrite your analysis as:

mainTable[ , by=year, {
  randomValue = as.integer(value)
  cat(sprintf('%d:%d\n', .BY$year, randomValue))
  start <- proc.time()[[3L]]
  for(i in 1:randomValue) {}
  elapsed = proc.time()[[3L]]- start
  statsTable[.BY, procTime := elapsed]
  print(elapsed)
  .(loopsPerSecond = randomValue / elapsed)
}]

(on your example, this runs substantially slower than your original code due to this somewhat technical issue)

6 Comments

@Liman you mentioned your actual use case is different from your sample code. Please have a look & try to adapt my suggestion to your own code & see if you are affected by the performance degradation. If so, I can suggest an alternative that doesn't fall victim to the same issue.
I understand that the main issue raised by @Chris is that the column procTime of the data.table statsTable is not getting updated inside lapply. Specifically, the question is Is there a way to do this and still use one of the the apply functions?. So, I do not really understand your point when you said my actual use case is different from my sample code.
@Liman I'm referring to this paragraph in your Q "The actual processing is more complex"
Ah @MichaelChirico! @Chris asked the question, not me :)
@MichaelChirico (and Liman), thanks, this is really interesting - I've looked at the vignette and can see there is more to DT than I realised. I can see you've replaced the apply with the data.table function that iterates over the table. I just have a few questions. If I want to merge the rows created a result of each iteration (as per my original), would I need to include a row bind to another data.table within each? I can't see in the syntax, the .(assignment) fits inside the {}. Do you mean this approach runs slower and is there a way around that? Thanks, Chris.
|
1

Perhaps, lapply() is not the best choice to meet OP's expectations.

The OP expects two results from his operation,

  • the result output and
  • an updated version of statsTable

Unfortunately, according to the documentation, lapply() returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

Instead of twisting lapply(), I suggest to use a for loop to iterate over mainTable$year and to update both results simultaneously:

out_list <- vector("list", length(mainTable$year))
for (idx in seq_along(mainTable$year)) {
  fileYear <- mainTable$year[idx]
  randomValue = as.integer(mainTable[idx, "value"])
  cat(fileYear, ":", randomValue, "\n")
  start <- proc.time()[[3]]
  for(i in 1:randomValue) {}
  elapsed = proc.time()[[3]]- start
  statsTable[year == fileYear]$procTime = elapsed
  cat(elapsed, "\n")
  out_list[[idx]] <- data.table(year = fileYear, loopsPerSecond = randomValue / elapsed)
}
output <- rbindlist(out_list)
print(output)
   year loopsPerSecond
1: 2016       71127692
2: 2017       79373691
3: 2018       96125167
4: 2019       90166990
5: 2020       83897274
print(statsTable)
   year procTime
1: 2016     0.24
2: 2017     0.11
3: 2018     0.03
4: 2019     0.29
5: 2020     0.38

out_list <- vector("list", length(mainTable$year)) initialises an empty list with as many slots as required to store the results. This will avoid to grow an object in a loop which is considered bad practice as it may deteriorate performance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.