I have two data frames:
> head(k)
V1
1 1814338070
2 1199215279
3 1283239083
4 1201972527
5 404900682
6 3093614019
> head(g)
start end state value
1 16777216 16777471 queensland 15169
2 16777472 16778239 fujian 0
3 16778240 16779263 victoria 56203
4 16779264 16781311 guangdong 0
5 16781312 16781823 tokyo 0
6 16781824 16782335 aichi 0
> dim(k)
[1] 624979 1
> dim(g)
[1] 5510305 4
I want to compare each value in data.frame(k) and match if it fits between the range of start and end of data.frame(g) and if it does return the value of state and value from data.frame(g)
The problem I have is due to the dimensions of both the data frame and to do the match and return my desired values it takes 5 hours on my computer. I've used the following method but I'm unable to make use of all cores on my computer and not even make it work correctly:
return_first_match_position <- function(int, start,end) {
match = which(int >= start & int <= end)
if(length(match) > 0){
return(match[1])
}
else {
return(match)
}
}
library(parallel)
cl = makeCluster(detectCores())
matches = Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end)
p = parSapply(cl, Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end), return_first_match_position)
stopCluster(cl)
desired output is % number of times state and value show up for every match of the number from data.frame(k) in data.frame(g)
Was wondering there there is an intelligent way of doing parallel processing in R ? And can anyone please suggest (any sources) how I can learn/improve writing functions in R?