3

I have two data frames:

> head(k)
          V1
1 1814338070
2 1199215279
3 1283239083
4 1201972527
5  404900682
6 3093614019

> head(g)
  start    end      state      value
1 16777216 16777471 queensland 15169
2 16777472 16778239     fujian     0
3 16778240 16779263   victoria 56203
4 16779264 16781311  guangdong     0
5 16781312 16781823      tokyo     0
6 16781824 16782335      aichi     0

> dim(k)
[1] 624979      1
> dim(g)
[1] 5510305       4

I want to compare each value in data.frame(k) and match if it fits between the range of start and end of data.frame(g) and if it does return the value of state and value from data.frame(g)

The problem I have is due to the dimensions of both the data frame and to do the match and return my desired values it takes 5 hours on my computer. I've used the following method but I'm unable to make use of all cores on my computer and not even make it work correctly:

return_first_match_position <- function(int, start,end) {
  match = which(int >= start & int <= end)
  if(length(match) > 0){
    return(match[1])
  }
  else {
    return(match)
  }
}

library(parallel)
cl = makeCluster(detectCores())
matches = Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end)
p = parSapply(cl, Vectorize(return_first_match_position, 'int')(k$V1,g$start, g$end), return_first_match_position)
stopCluster(cl)

desired output is % number of times state and value show up for every match of the number from data.frame(k) in data.frame(g)

Was wondering there there is an intelligent way of doing parallel processing in R ? And can anyone please suggest (any sources) how I can learn/improve writing functions in R?

6
  • 1
    can you give an example of the desired output? i believe i have a solution for you, but wanna make sure I know exactly what you're looking for Commented Mar 21, 2014 at 17:30
  • For example if in data.frame(k) the value 1814338070 falls between the range of 16777472-16778239 in data.frame(g) desired output is %state and %value.. just an FYI state and value in data.frame(g) are factors Commented Mar 21, 2014 at 17:33
  • so for each value in k, you are looking for the state,value start<k<end ? so your desired output will also be length(k)? Commented Mar 21, 2014 at 17:53
  • desired output is % number of times state and value show up for every match of the number from data.frame(k) in data.frame(g). I apolgize should have made this clear, my editing my question on this Commented Mar 21, 2014 at 18:02
  • can you please dput() your 'g' and 'k' and then show a sample of what the desired output would look like? that would help A LOT Commented Mar 21, 2014 at 18:14

2 Answers 2

2

I think you want to do a rolling join. This can be done very efficiently with data.table:

DF1 <- data.frame(V1=c(1.5, 2, 0.3, 1.7, 0.5))
DF2 <- data.frame(start=0:3, end=0.9:3.9, 
                  state=c("queensland", "fujian", "victoria", "guangdong"),
                  value=1:4)

library(data.table)
DT1 <- data.table(DF1, key="V1")
DT1[, pos:=V1]
#    V1 pos
#1: 0.3 0.3
#2: 0.5 0.5
#3: 1.5 1.5
#4: 1.7 1.7
#5: 2.0 2.0
DT2 <- data.table(DF2, key="start")
#   start end      state value
#1:     0 0.9 queensland     1
#2:     1 1.9     fujian     2
#3:     2 2.9   victoria     3
#4:     3 3.9  guangdong     4

DT2[DT1, roll=TRUE]
#   start end      state value pos
#1:     0 0.9 queensland     1 0.3
#2:     0 0.9 queensland     1 0.5
#3:     1 1.9     fujian     2 1.5
#4:     1 1.9     fujian     2 1.7
#5:     2 2.9   victoria     3 2.0
Sign up to request clarification or add additional context in comments.

5 Comments

I tried you method and got stuck as well ..... > df2 <- data.table(g, key="start") Error in forder(x, cols, sort = TRUE, retGrp = FALSE) : Internal error: isort passed all-NA. isorted should have caught this before this point > df2 <- data.table(g, key="start",na.rm=T) Error in chmatch("data.frame", tt) : Internal error: savetl_init checks failed (0 100 0x19db2830 0x3f3aeb0). Please report to datatable-help.
Well, obviously you don't give enough information. What happens if you do df2 <- data.table(g); setkey(df2, start)?
df2 <- data.table(g) Error in chmatch("data.frame", tt) : Internal error: savetl_init checks failed (0 100 0x19db2830 0x3f3aeb0). Please report to datatable-help.
Either there is something fishy with your data or you found a bug in data.table. Can't tell without a reproducible example. If you can create one you should report this to the data.table maintainers.
+1. @user3006691 As it happens, those errors look familiar and are fixed in v1.9.3 available from R-Forge. If upgrading doesn't fix it, yes we'll need to see a reproducible example.
1

so instead of editing the last one a lot (pretty much making a new one).. is this what you want: I noticed that your end is always 1 before the next rows start, so what you want ( i think) is to just find out how many were within each interval and give that interval the state,value for that range. so

set.seed(123)
c1=seq(1,25,4)
c2=seq(4,30,4)
c3=letters[1:7]
c4=sample(seq(1,7),7)
c.all=cbind(c1,c2,c3,c4)

> c.all  ### example data.frame that looks similar to yours
     c1   c2   c3  c4 
[1,] "1"  "4"  "a" "3"
[2,] "5"  "8"  "b" "7"
[3,] "9"  "12" "c" "2"
[4,] "13" "16" "d" "1"
[5,] "17" "20" "e" "6"
[6,] "21" "24" "f" "5"
[7,] "25" "28" "g" "4"

k1 <- sample(seq(1,18),20,replace=T)

k1
 [1]  2  1 15 14  4 15  3 17 18  1  4  3 16 15  2  4  8 11  7 16

fallsin <- cut(k1,  c(as.numeric(c.all[,1]), max(c.all[,2])), labels=paste(c.all[,3],  c.all[,4],sep=':'), right=F)

fallsin
[1] a:3 a:3 e:6 e:6 a:3 e:6 a:3 f:5 f:5 a:3 a:3 a:3 e:6 e:6 a:3 a:3 c:2 d:1 b:7 e:6
Levels: a:3 b:7 c:2 d:1 e:6 f:5 g:4
prop.table(table(fallsin))

 a:3  b:7  c:2  d:1  e:6  f:5  g:4 
0.45 0.05 0.05 0.05 0.30 0.10 0.00 

where the names of the columns are the 'state:value' and the numbers are the percent of k1 that fall within the range of that label

2 Comments

yes, this exactly what I was trying to explain (but failed amzingly). Its just that I am unable to do this faster for the two data.frames k,g which are huge
im not as good with data.table yet, but I think doing cut(or something similar) with data.table will work better because of sorting and indexing, so I'd check out 'data.table' and look for a similar function (close to what Roland suggested)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.