I'm relatively new to R programing and am trying to figure out how to use custom functions to evaluate new columns of a data frame using dplyr or data.table in a memory efficient manner. Can someone please help
Here is a brief summary of my problem
Data frames 1 and 2 have the same type and number of columns
df1 <- data.frame(col1 = c("A", "B", "C"), col2 = c(10,20,30))
df2 <- data.frame(col1 = c("DA", "EE", "FB", "C"), col2 = c(10,20,30,40))
These data frames have millions of records.
Now I want to add a new column to one of the data frames (say df1) by using the values in df2.
library(dplyr)
calculateCol3 <- function(word) {
df2 %>%
filter(grepl(paste0(word, "$"),col1) )%>%
summarize(col3= sum(col2))
col3
}
df1 %>% group_by(col1) %>% mutate(col3 = calcualteCol3(col1))
This method works but it is painfully slow and I guess this is because of copying the data sets too many times. Can someone suggest a better way of doing the same? The expected result is:
col1 col2 col3
A 10 10
B 20 30
C 30 40
I also tried converting the data frames to data.table as follows
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt1[, col3 := calculateCol3(col1)}, by = 1:nrow(dt1)]
Everything seems to be slow. Am sure there is a better way to achieve this. Can someone help
Thanks
df3 = grepl(paste0(word, '$'), df2$col1))should be a binaryTRUE/FALSE. How do you expectdf3$col2to behave?