0

I'm trying to do an automatic code, I have build a loop to match people in my 2 databases. here an example of thoses two database:

some library might be needed so here all I used:

library(readr)
library(stringi)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(rlang)
library(Hmisc)
library(sqldf)
library(tcltk)
library(tcltk2)
library(gWidgets2)
library(gWidgets2tcltk)
df1 <- data.frame(
  id = c(1, 2, 3),
  name = c("John", "Jane", "Jim"),
  age = c(25, 30, 35)
)

df2 <- data.frame(
  id_2 = c(4, 5, 3,9),
  name_2 = c("Johny", "Janey", "Jim","Gar"),
  age_2 = c(26, 31, 35,NA)
)

and I have some condition in string that I got by readline():

#condtion I got by readline()
condition =c("df1$id[i]==df2$id_2[j]","df1$name[i]==df2$name_2[j]","df1$age[i]==df2$age_2[j] && !is.na(df2$age_2[j])")

I wanna have a loop that apply a score for every condition respected and put people with a minimal score of similarity in a new dataframe.

here what I tried but it take so much time to execute:

#new dataframe that assign similarity score between people
df3 <- data.frame(id=integer(), id_2=integer(), score=integer())


#only way I know to use string condition in if statement
  start_time <- Sys.time()
  
# loop to search on df1 and df2
  for (i in 1:nrow(df1)){
    for (j in 1:nrow(df2)){
      score = 0
      
      # Vérifie condition
      for (k in 1:length(condition)){
      if(eval(parse(text = condition[k]))){
        score = score + 1
      }
      }
      
      #add to new dataframe if condition verified 
      if(score == (length(condition))/2){
        df3 <- rbind(df3, data.frame(id=df1[i, 1], id_2=df2[j, 1], score=score))
    }
    }
  }
  
  #system to verify time spend
  end_time <- Sys.time()
  elapsed_time <- difftime(end_time, start_time, units = "secs")
  elapsed_time

View(df3)
#the result of df3:
id  id_2  score
3   3      3
#then just need to merge by ID and id_2 to get all other information 
#what you should get after merging
id  id_2  score  name  age  name_2 age_2
3   3      3     "Jim"  35  "Jim"  35

I need to repoduce this for 2 dataframe of 683870 * 3681

I already did a sql code that is almost instant but there no score so some client with some small difference are left out. Im also gonna add a readline for the minimal score intended later.

edit: I got the condition by readline() since I try to make the code accesible for not regular R user:

condition = readline("what data you wanna compare exemple: [data1==data2] ")

for the if(score == (length(condition))/2) I know its not useful here but its just for the exemple I will add a readline() for the score too

6
  • 2
    Can you provide an expected output please. there's almost certainly a much more efficient way of doing this. Seems like you might just need to do a left join? Commented Aug 16, 2023 at 14:09
  • 2
    This code is never going to produce anything in df3. Your test for adding a row to df2 is if(score == (length(condition))/2). However, condition is length 3, so you are testing to see whether score == 1.5, which it can't, because it starts at 0 and you only ever increment by 1. Can you explain in words what you're trying to do as well as posting expected output? Commented Aug 16, 2023 at 14:13
  • 4
    BTW, iteratively adding rows to a frame using rbind(old, newrow) works in practice but scales horribly, see "Growing Objects" in The R Inferno. For each row added, it makes a complete copy of all rows in old, which works but starts to slow down a lot. It is far better to produce a list of these new rows and then rbind them at one time; e.g., out <- list(); for (...) { out <- c(out, list(newrow)); }; alldat <- do.call(rbind, out);. Commented Aug 16, 2023 at 14:16
  • 5
    Why do you have R code stored as character strings? That is your main performance issue. Parsing code is sloooooow. Commented Aug 16, 2023 at 14:21
  • 2
    If you must use these conditions in a string, you should parse once cond <- parse(text = paste(condition, collapse = ";")) and then do eval(cond[[i]]) in the loop. However, we most likely have an xy problem here and the code in the character strings and these loops are not needed at all. I suspect, your first step should be a join of the data.frames. Alternatively, this looks like a distance calculation and you should use a dedicated function for those. An actual reproducible example, which includes representative output would be useful. Commented Aug 16, 2023 at 14:32

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.