I'm trying to do an automatic code, I have build a loop to match people in my 2 databases. here an example of thoses two database:
some library might be needed so here all I used:
library(readr)
library(stringi)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(rlang)
library(Hmisc)
library(sqldf)
library(tcltk)
library(tcltk2)
library(gWidgets2)
library(gWidgets2tcltk)
df1 <- data.frame(
id = c(1, 2, 3),
name = c("John", "Jane", "Jim"),
age = c(25, 30, 35)
)
df2 <- data.frame(
id_2 = c(4, 5, 3,9),
name_2 = c("Johny", "Janey", "Jim","Gar"),
age_2 = c(26, 31, 35,NA)
)
and I have some condition in string that I got by readline():
#condtion I got by readline()
condition =c("df1$id[i]==df2$id_2[j]","df1$name[i]==df2$name_2[j]","df1$age[i]==df2$age_2[j] && !is.na(df2$age_2[j])")
I wanna have a loop that apply a score for every condition respected and put people with a minimal score of similarity in a new dataframe.
here what I tried but it take so much time to execute:
#new dataframe that assign similarity score between people
df3 <- data.frame(id=integer(), id_2=integer(), score=integer())
#only way I know to use string condition in if statement
start_time <- Sys.time()
# loop to search on df1 and df2
for (i in 1:nrow(df1)){
for (j in 1:nrow(df2)){
score = 0
# Vérifie condition
for (k in 1:length(condition)){
if(eval(parse(text = condition[k]))){
score = score + 1
}
}
#add to new dataframe if condition verified
if(score == (length(condition))/2){
df3 <- rbind(df3, data.frame(id=df1[i, 1], id_2=df2[j, 1], score=score))
}
}
}
#system to verify time spend
end_time <- Sys.time()
elapsed_time <- difftime(end_time, start_time, units = "secs")
elapsed_time
View(df3)
#the result of df3:
id id_2 score
3 3 3
#then just need to merge by ID and id_2 to get all other information
#what you should get after merging
id id_2 score name age name_2 age_2
3 3 3 "Jim" 35 "Jim" 35
I need to repoduce this for 2 dataframe of 683870 * 3681
I already did a sql code that is almost instant but there no score so some client with some small difference are left out. Im also gonna add a readline for the minimal score intended later.
edit: I got the condition by readline() since I try to make the code accesible for not regular R user:
condition = readline("what data you wanna compare exemple: [data1==data2] ")
for the if(score == (length(condition))/2) I know its not useful here but its just for the exemple I will add a readline() for the score too
df3. Your test for adding a row todf2isif(score == (length(condition))/2). However,conditionis length 3, so you are testing to see whetherscore == 1.5, which it can't, because it starts at0and you only ever increment by1. Can you explain in words what you're trying to do as well as posting expected output?rbind(old, newrow)works in practice but scales horribly, see "Growing Objects" in The R Inferno. For each row added, it makes a complete copy of all rows inold, which works but starts to slow down a lot. It is far better to produce a list of these new rows and thenrbindthem at one time; e.g.,out <- list(); for (...) { out <- c(out, list(newrow)); }; alldat <- do.call(rbind, out);.cond <- parse(text = paste(condition, collapse = ";"))and then doeval(cond[[i]])in the loop. However, we most likely have an xy problem here and the code in the character strings and these loops are not needed at all. I suspect, your first step should be a join of the data.frames. Alternatively, this looks like a distance calculation and you should use a dedicated function for those. An actual reproducible example, which includes representative output would be useful.