1

I'm relatively new to R programing and am trying to figure out how to use custom functions to evaluate new columns of a data frame using dplyr or data.table in a memory efficient manner. Can someone please help

Here is a brief summary of my problem

Data frames 1 and 2 have the same type and number of columns

df1 <- data.frame(col1 = c("A", "B", "C"), col2 = c(10,20,30))
df2 <- data.frame(col1 = c("DA", "EE", "FB", "C"), col2 = c(10,20,30,40))

These data frames have millions of records.

Now I want to add a new column to one of the data frames (say df1) by using the values in df2.

library(dplyr)

calculateCol3 <- function(word) {
df2 %>%
    filter(grepl(paste0(word, "$"),col1) )%>%
    summarize(col3= sum(col2))
col3
}

df1 %>% group_by(col1) %>% mutate(col3 = calcualteCol3(col1))

This method works but it is painfully slow and I guess this is because of copying the data sets too many times. Can someone suggest a better way of doing the same? The expected result is:

col1 col2 col3
   A   10   10
   B   20   30
   C   30   40 

I also tried converting the data frames to data.table as follows

dt1 <- data.table(df1)
dt2 <- data.table(df2)

dt1[, col3 := calculateCol3(col1)}, by = 1:nrow(dt1)]

Everything seems to be slow. Am sure there is a better way to achieve this. Can someone help

Thanks

7
  • Yeah, as a general rule, you should try to write your function so that it doesn't need to be applied NROW separate times. (It's not clear to me what your function is supposed to do, so I can't help with anything more specific.) Commented Dec 22, 2016 at 4:44
  • 2
    Your example code doesn't work at all. Commented Dec 22, 2016 at 4:45
  • df3 = grepl(paste0(word, '$'), df2$col1)) should be a binary TRUE/FALSE . How do you expect df3$col2 to behave? Commented Dec 22, 2016 at 5:13
  • I have edited the function. I expect the result as follows Commented Dec 22, 2016 at 5:21
  • I have edited the function. I expect the result as follows head(df1) col1 col2 col3 A 10 10 B 20 30 C 30 40 Commented Dec 22, 2016 at 5:24

2 Answers 2

3

If you want an efficient solution I would suggest you won't use regex and don't do by-row operations. If all your function is doing is to join by the last letter, you could just get that latter without using regex and then do a binary join using data.table (for efficiency)

library(data.table)
setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]
setDT(df1)[df2, col3 := i.col2, on = .(col1 = EndWith)]
df1
#    col1 col2 col3
# 1:    A   10   10
# 2:    B   20   30
# 3:    C   30   40

Now, by looking at your function, it seems like you also trying to sum the values in df2$col2 per join. No problem, you can run functions while doing a binary join in data.table too. Lets say this is your df2 (just to illustrate when you have more than a single value per last letter)

df2 <- data.frame(col1 = c("DA", "FA", "EE", "FB", "C", "fC"), col2 = c(10,20,10,30,40,30))
df2
#   col1 col2
# 1   DA   10
# 2   FA   20
# 3   EE   10
# 4   FB   30
# 5    C   40
# 6   fC   30

The first step is the same

setDT(df2)[, EndWith := substring(col1, nchar(as.character(col1)))]

While the second step will involve a binary join- just to the opposite way, while adding , by = .EACHI and specifying your desired function

setDT(df2)[df1, .(col2 = i.col2, col3 = sum(col2)), on = .(EndWith = col1), by = .EACHI]
#    EndWith col2 col3
# 1:       A   10   30
# 2:       B   20   30
# 3:       C   30   70
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks a lot. that solution is really fast
@David Arenburg I liked your answer here because that question was duplicated ! you Rock
0

Using the fuzzyjoin package, I think you can make this work. E.g.:

#install.packages("fuzzyjoin")
df1$col1regex <- paste0(df1$col1,"$")
regex_join(df2, df1, by=c(col1="col1regex"), mode="right")

#  col1.x col2.x col1.y col2.y col1regex
#1     DA     10      A     10        A$
#2     FB     30      B     20        B$
#3      C     40      C     30        C$

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.