0

I have one data frame (Table_A) with 3.4 million rows and 33 columns. I have another data frame with 384 rows and 3 columns (Table_B). (This is for one participant, I should have 40 at the end)

Table_A

Col1
100
143
178
245
265

Table_B

start stop name
101 144 Name1
154 254 Name2

What I want to do is subset Table A by Col1, by start and stop columns in Table B and give each subset row a name. To return

Table_A adapted

Col1 name
143 Name1
178 Name2
245 Name2

I have tried

df_sub <- subset(Table_A, (Col1 >= (Table_B$start)) & (Col1 <= (Table_B$stop))```
names <- Table_B$name[(Table_A$Col1 >= (Table_B$start)) & (Table_A$Col1 <= (Table_B$Col2))]

df_out <- cbind(df_sub, names)

However, df_sub seems to only return one/two rows per subset and there should be ~187 in half (192) and ~375 in the other half. Whereas names returns 2million + rows.

I tried

(Table_A$Col1 >= (Table_B$Col1)) & (Table_A$Col1 <= (Table_B$Col2))

and this returns a list of False up to 384 and NA after

2 Answers 2

1

this is a job for a data.table non-equi join

library(data.table)
# if tablea and tableb are not already data.table
setDT(tablea);setDT(tableb)

# non-equi join
tablea[tableb, name := i.name, on = .(Col1 >= start, Col1 <= stop)]
#    Col1  name
# 1:  100  <NA>
# 2:  143 Name1
# 3:  178 Name2
# 4:  245 Name2
# 5:  265  <NA>

sample data

tablea <- fread("Col1
                100
                143
                178
                245
                265")
tableb <- fread("start  stop    name
                101     144     Name1
                154     254     Name2")
Sign up to request clarification or add additional context in comments.

Comments

0

Example using a for loop.

Say we have the following dataframes

#dt, corresponds to table a
dt <- structure(list(col1 = c(100, 143, 178, 245, 265)), class = "data.frame", row.names = c(NA, 
-5L))

#dt2, corresponds to table b
structure(list(start = c(101, 154), stop = c(144, 254), name = c("Name1", 
"Name2")), class = "data.frame", row.names = c(NA, -2L))

We could loop through the rows of table B and add the names to table A if col1 falls within the range. Using data.table:

setDT(dt) #table A to data.table
setDT(dt2) #table B to data.table

#add column 'name'
dt[, name := ""]

#loop through table B
for(i in 1:nrow(dt2)){
  dt[col1 >= dt2[i,]$start & col1 <= dt2[i,]$stop, name := paste0(name, dt2[i,]$name)]
}

Output

dt
  col1  name
1:  100      
2:  143 Name1
3:  178 Name2
4:  245 Name2
5:  265 

Note I used a paste0 when adding the name. Otherwise, in the loop the names would be overwritten if the range in table B is not unique. Now, if the values fall within two ranges in table B, the name column will get two names pasted after eachother.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.