0

I am trying to get away from loops in R and was looking to both vectorize and speed up a section of my code.

I am looking to convert a For loop using lapply, but am getting an error: enter image description here

Reproducible example:

library(dplyr)

# This works using a For loop -----------------------------------

# create sample data frame
df <- data.frame(Date  = rep(c("Jan1", "Jan2", "Jan3"), 3),
                 Item  = c(rep("A", 3), rep("B", 3), rep("C", 3)),
                 Value = 10:18)


diff <- numeric() # initialize

# Loop through each item and take difference of latest value from earlier values
for (myitem in unique(df$Item)) {

    y = df[df$Date == last(df$Date) & df$Item == myitem, "Value"]  # Latest value for an item

    x = df[df$Item == myitem, "Value"]                             # Every value for an item

    diff <- c(diff, y-x)

}

df_final <- mutate(df, Difference = diff)
df_final

enter image description here

I found related questions here (lapply), here (lapply), and here ($ operator) but none really helped me with my question.

Here is how I tried to vectorize using lapply:

# Same thing using vectorized approach ----------------------------------

mylist <- list(unique(df$Item))

myfunction <- function(df = df, diff = numeric()) {

    y = df[df$Date == last(df$Date) & df$Item == mylist, "Value"]  # Latest value for an item

    x = df[df$Item == mylist, "Value"]                             # Every value for an item

    diff <- c(diff, y-x)

}

# throws error
diff_vector <- unlist(lapply(mylist, myfunction))

df_final2 <- mutate(df, Difference = diff_vector)
df_final2

My real data set has hundreds of thousand of rows. If someone could point me in the right direction on how to vectorize this to get the same output as the For loop I would appreciate it.

Thanks!

1
  • 2
    lapply is a loop. It generally won't make your code faster. It's just nicer and more convenient syntax. Commented Jun 27, 2018 at 5:52

3 Answers 3

5

So lapply isn't being used quite right here, that's all!

lapply applies a function to each element of a list. To be explicit, it takes each element of a list, and applies the function to that element.

So if you want it to apply a function to several subsets of a data frame, you need to get it a list which is several subsets of a data frame. So let's create that list first.

We can do this using the split function, it splits your data frame into several data frames based on a column and stores these as a list. A list of subsets of a data frame. Perfect!

So let's replace the line where you create mylist with this line instead.

mylist <- split(df,df[,c("Item")])

Now we just need to make some changes tomyfunction. Remember we're now passing through our data already subsetted, so we can remove the conditions about the Item matching with what we'd expect. Remember this function will get applied to each of these data frames in their entirety.

myfunction <- function(df = df, diff = numeric()) { 
    y = df[df$Date == last(df$Date), "Value"]  # Latest value for an item

    x = df[, "Value"]                             # Every value for an item

    diff <- c(diff, y-x)
}

And the rest my friend, is exactly as you have it :)

Sign up to request clarification or add additional context in comments.

2 Comments

Got it, so my code was effectively giving the function a list of characters and not a list of data frames. As a result, when it hit the code df$date, it threw an error because there was no data frame that lapply supplied. Is that correct?
Bingo! It couldn't find an object date in the string "A" so it was getting cranky with you :) My apologies I didn't end up addressing that concern in the original answer! I'll take solace that it was sufficiently clear that you were able to work that out, though!
1

I'm not sure lapply is the right approach to take. I'd stick with mutate - which you already seem to be using:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(Date  = rep(c("Jan1", "Jan2", "Jan3"), 3),
  Item  = c(rep("A", 3), rep("B", 3), rep("C", 3)),
  Value = 10:18)

df <- df %>%
  group_by(Item) %>%
  mutate(diff = last(Value) - Value)

df
#> # A tibble: 9 x 4
#> # Groups:   Item [3]
#>   Date  Item  Value  diff
#>   <fct> <fct> <int> <int>
#> 1 Jan1  A        10     2
#> 2 Jan2  A        11     1
#> 3 Jan3  A        12     0
#> 4 Jan1  B        13     2
#> 5 Jan2  B        14     1
#> 6 Jan3  B        15     0
#> 7 Jan1  C        16     2
#> 8 Jan2  C        17     1
#> 9 Jan3  C        18     0

Created on 2018-06-27 by the reprex package (v0.2.0).

This does assume that the observations (at least within the "Item" group) are arranged in order. If not, add arrange(Date) %>% as a step after group_by

1 Comment

Melissa, good point! Thank you for pointing this out as this would also work with the rest of my code and is simpler.
1

you could create a table with the latest value, join with the original table and get the difference or use data.table to create an additional column with latest value

library(data.table)
df <- data.frame(Date  = rep(c("Jan1", "Jan2", "Jan3"), 3),
                 Item  = c(rep("A", 3), rep("B", 3), rep("C", 3)),
                 Value = 10:18)

setDT(df)

df[,latestVal:=last(Value),by=.(Item)][,diff:=latestVal-Value][,.(Date,Item,Value,diff)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.