Parallel Processing in R using "parallel" package

Question

I have two data frames:

> head(k)
          V1
1 1814338070
2 1199215279
3 1283239083
4 1201972527
5  404900682
6 3093614019

> head(g)
  start    end      state      value
1 16777216 16777471 queensland 15169
2 16777472 16778239     fujian     0
3 16778240 16779263   victoria 56203
4 16779264 16781311  guangdong     0
5 16781312 16781823      tokyo     0
6 16781824 16782335      aichi     0

> dim(k)
[1] 624979      1
> dim(g)
[1] 5510305       4

I want to compare each value in data.frame(k) and match if it fits between the range of start and end of data.frame(g) and if it does return the value of state and value from data.frame(g)

The problem I have is due to the dimensions of both the data frame and to do the match and return my desired values it takes 5 hours on my computer. I've used the following method but I'm unable to make use of all cores on my computer and not even make it work correctly:

return_first_match_position <- function(int, start,end) {
  match = which(int >= start & int <= end)
  if(length(match) > 0){
    return(match[1])
  }
  else {
    return(match)
  }
}

library(parallel)
cl = makeCluster(detectCores())
matches = Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end)
p = parSapply(cl, Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end), return_first_match_position)
stopCluster(cl)

desired output is % number of times state and value show up for every match of the number from data.frame(k) in data.frame(g)

Was wondering there there is an intelligent way of doing parallel processing in R ? And can anyone please suggest (any sources) how I can learn/improve writing functions in R?

can you give an example of the desired output? i believe i have a solution for you, but wanna make sure I know exactly what you're looking for — James Tobin
– James Tobin, Commented Mar 21, 2014 at 17:30
For example if in data.frame(k) the value 1814338070 falls between the range of 16777472-16778239 in data.frame(g) desired output is %state and %value.. just an FYI state and value in data.frame(g) are factors — user3006691
– user3006691, Commented Mar 21, 2014 at 17:33
so for each value in k, you are looking for the state,value start<k<end ? so your desired output will also be length(k)? — James Tobin
– James Tobin, Commented Mar 21, 2014 at 17:53
desired output is % number of times state and value show up for every match of the number from data.frame(k) in data.frame(g). I apolgize should have made this clear, my editing my question on this — user3006691
– user3006691, Commented Mar 21, 2014 at 18:02
can you please dput() your 'g' and 'k' and then show a sample of what the desired output would look like? that would help A LOT — James Tobin
– James Tobin, Commented Mar 21, 2014 at 18:14

Roland · Accepted Answer · 2014-03-21 17:32:37Z

2

I think you want to do a rolling join. This can be done very efficiently with data.table:

DF1 <- data.frame(V1=c(1.5, 2, 0.3, 1.7, 0.5))
DF2 <- data.frame(start=0:3, end=0.9:3.9, 
                  state=c("queensland", "fujian", "victoria", "guangdong"),
                  value=1:4)

library(data.table)
DT1 <- data.table(DF1, key="V1")
DT1[, pos:=V1]
#    V1 pos
#1: 0.3 0.3
#2: 0.5 0.5
#3: 1.5 1.5
#4: 1.7 1.7
#5: 2.0 2.0
DT2 <- data.table(DF2, key="start")
#   start end      state value
#1:     0 0.9 queensland     1
#2:     1 1.9     fujian     2
#3:     2 2.9   victoria     3
#4:     3 3.9  guangdong     4

DT2[DT1, roll=TRUE]
#   start end      state value pos
#1:     0 0.9 queensland     1 0.3
#2:     0 0.9 queensland     1 0.5
#3:     1 1.9     fujian     2 1.5
#4:     1 1.9     fujian     2 1.7
#5:     2 2.9   victoria     3 2.0

answered Mar 21, 2014 at 17:32

Roland

134k12 gold badges203 silver badges305 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3006691 Over a year ago

I tried you method and got stuck as well ..... > df2 <- data.table(g, key="start") Error in forder(x, cols, sort = TRUE, retGrp = FALSE) : Internal error: isort passed all-NA. isorted should have caught this before this point > df2 <- data.table(g, key="start",na.rm=T) Error in chmatch("data.frame", tt) : Internal error: savetl_init checks failed (0 100 0x19db2830 0x3f3aeb0). Please report to datatable-help.

Roland Over a year ago

Well, obviously you don't give enough information. What happens if you do df2 <- data.table(g); setkey(df2, start)?

user3006691 Over a year ago

df2 <- data.table(g) Error in chmatch("data.frame", tt) : Internal error: savetl_init checks failed (0 100 0x19db2830 0x3f3aeb0). Please report to datatable-help.

Roland Over a year ago

Either there is something fishy with your data or you found a bug in data.table. Can't tell without a reproducible example. If you can create one you should report this to the data.table maintainers.

Matt Dowle Over a year ago

+1. @user3006691 As it happens, those errors look familiar and are fixed in v1.9.3 available from R-Forge. If upgrading doesn't fix it, yes we'll need to see a reproducible example.

James Tobin · Accepted Answer · 2014-03-21 19:00:56Z

1

so instead of editing the last one a lot (pretty much making a new one).. is this what you want: I noticed that your end is always 1 before the next rows start, so what you want ( i think) is to just find out how many were within each interval and give that interval the state,value for that range. so

set.seed(123)
c1=seq(1,25,4)
c2=seq(4,30,4)
c3=letters[1:7]
c4=sample(seq(1,7),7)
c.all=cbind(c1,c2,c3,c4)

> c.all  ### example data.frame that looks similar to yours
     c1   c2   c3  c4 
[1,] "1"  "4"  "a" "3"
[2,] "5"  "8"  "b" "7"
[3,] "9"  "12" "c" "2"
[4,] "13" "16" "d" "1"
[5,] "17" "20" "e" "6"
[6,] "21" "24" "f" "5"
[7,] "25" "28" "g" "4"

k1 <- sample(seq(1,18),20,replace=T)

k1
 [1]  2  1 15 14  4 15  3 17 18  1  4  3 16 15  2  4  8 11  7 16

fallsin <- cut(k1,  c(as.numeric(c.all[,1]), max(c.all[,2])), labels=paste(c.all[,3],  c.all[,4],sep=':'), right=F)

fallsin
[1] a:3 a:3 e:6 e:6 a:3 e:6 a:3 f:5 f:5 a:3 a:3 a:3 e:6 e:6 a:3 a:3 c:2 d:1 b:7 e:6
Levels: a:3 b:7 c:2 d:1 e:6 f:5 g:4
prop.table(table(fallsin))

 a:3  b:7  c:2  d:1  e:6  f:5  g:4 
0.45 0.05 0.05 0.05 0.30 0.10 0.00

where the names of the columns are the 'state:value' and the numbers are the percent of k1 that fall within the range of that label

answered Mar 21, 2014 at 19:00

James Tobin

3,12021 silver badges36 bronze badges

2 Comments

user3006691 Over a year ago

yes, this exactly what I was trying to explain (but failed amzingly). Its just that I am unable to do this faster for the two data.frames k,g which are huge

James Tobin Over a year ago

im not as good with data.table yet, but I think doing cut(or something similar) with data.table will work better because of sorting and indexing, so I'd check out 'data.table' and look for a similar function (close to what Roland suggested)

Collectives™ on Stack Overflow

Parallel Processing in R using "parallel" package

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related