Adding new column using custom function in data frame using dplyr/data.table in R

Question

I'm relatively new to R programing and am trying to figure out how to use custom functions to evaluate new columns of a data frame using dplyr or data.table in a memory efficient manner. Can someone please help

Here is a brief summary of my problem

Data frames 1 and 2 have the same type and number of columns

df1 <- data.frame(col1 = c("A", "B", "C"), col2 = c(10,20,30))
df2 <- data.frame(col1 = c("DA", "EE", "FB", "C"), col2 = c(10,20,30,40))

These data frames have millions of records.

Now I want to add a new column to one of the data frames (say df1) by using the values in df2.

library(dplyr)

calculateCol3 <- function(word) {
df2 %>%
    filter(grepl(paste0(word, "$"),col1) )%>%
    summarize(col3= sum(col2))
col3
}

df1 %>% group_by(col1) %>% mutate(col3 = calcualteCol3(col1))

This method works but it is painfully slow and I guess this is because of copying the data sets too many times. Can someone suggest a better way of doing the same? The expected result is:

col1 col2 col3
   A   10   10
   B   20   30
   C   30   40

I also tried converting the data frames to data.table as follows

dt1 <- data.table(df1)
dt2 <- data.table(df2)

dt1[, col3 := calculateCol3(col1)}, by = 1:nrow(dt1)]

Everything seems to be slow. Am sure there is a better way to achieve this. Can someone help

Thanks

Yeah, as a general rule, you should try to write your function so that it doesn't need to be applied NROW separate times. (It's not clear to me what your function is supposed to do, so I can't help with anything more specific.) — Frank
– Frank, Commented Dec 22, 2016 at 4:44
df3 = grepl(paste0(word, '$'), df2$col1)) should be a binary TRUE/FALSE . How do you expect df3$col2 to behave? — Aramis7d
– Aramis7d, Commented Dec 22, 2016 at 5:13
I have edited the function. I expect the result as follows head(df1) col1 col2 col3 A 10 10 B 20 30 C 30 40 — user7328626
– user7328626, Commented Dec 22, 2016 at 5:24

David Arenburg · Accepted Answer · 2016-12-22 09:18:41Z

3

If you want an efficient solution I would suggest you won't use regex and don't do by-row operations. If all your function is doing is to join by the last letter, you could just get that latter without using regex and then do a binary join using data.table (for efficiency)

library(data.table)
setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]
setDT(df1)[df2, col3 := i.col2, on = .(col1 = EndWith)]
df1
#    col1 col2 col3
# 1:    A   10   10
# 2:    B   20   30
# 3:    C   30   40

Now, by looking at your function, it seems like you also trying to sum the values in df2$col2 per join. No problem, you can run functions while doing a binary join in data.table too. Lets say this is your df2 (just to illustrate when you have more than a single value per last letter)

df2 <- data.frame(col1 = c("DA", "FA", "EE", "FB", "C", "fC"), col2 = c(10,20,10,30,40,30))
df2
#   col1 col2
# 1   DA   10
# 2   FA   20
# 3   EE   10
# 4   FB   30
# 5    C   40
# 6   fC   30

The first step is the same

setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]

While the second step will involve a binary join- just to the opposite way, while adding , by = .EACHI and specifying your desired function

setDT(df2)[df1, .(col2 = i.col2, col3 = sum(col2)), on = .(EndWith = col1), by = .EACHI]
#    EndWith col2 col3
# 1:       A   10   30
# 2:       B   20   30
# 3:       C   30   70

edited Dec 22, 2016 at 9:18

answered Dec 22, 2016 at 9:13

David Arenburg

92.4k18 gold badges145 silver badges202 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user7328626 Over a year ago

Thanks a lot. that solution is really fast

nik Over a year ago

@David Arenburg I liked your answer here because that question was duplicated ! you Rock

thelatemail · Accepted Answer · 2016-12-22 05:48:11Z

0

Using the fuzzyjoin package, I think you can make this work. E.g.:

#install.packages("fuzzyjoin")
df1$col1regex <- paste0(df1$col1,"$")
regex_join(df2, df1, by=c(col1="col1regex"), mode="right")

#  col1.x col2.x col1.y col2.y col1regex
#1     DA     10      A     10        A$
#2     FB     30      B     20        B$
#3      C     40      C     30        C$

answered Dec 22, 2016 at 5:48

thelatemail

94.3k12 gold badges140 silver badges197 bronze badges

Collectives™ on Stack Overflow

Adding new column using custom function in data frame using dplyr/data.table in R

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related