I have the following portion of my dataset:
structure(list(domain = c("A1BG_-_-_0", "A1BG_-_-_1", "A1BG_-_-_2",
"A1BG_-_-_3", "A1BG_-_-_4", "A1BG_143228_143228_0", "A1BG_143228_143228_1",
"A1BG_143228_143228_2", "A1BG_143228_143228_3", "A1CF_-_-_0"),
chr = c("19", "19", "19", "19", "19", "19", "19", "19", "19",
"10"), positions = c("(58858387..58858395,58858718..58858719)",
"(58858998..58859006,58861735..58862017,58862756..58862766)",
"(58863018..58863053,58863648..58863673)", "(58863913..58863921,58864293..58864303)",
"(58864552..58864563,58864657..58864693,58864769..58864803)",
"(58858719..58858998)", "(58862766..58863018)", "(58863673..58863913)",
"(58864303..58864552)", "(52566488..52566640,52569653..52569717)"
), length = c(11L, 303L, 62L, 20L, 84L, 280L, 253L, 241L,
250L, 218L)), class = "data.frame", row.names = c(NA, -10L
))
The column positions specifies a sequence of one or more start..stop positions separated by a comma.
Additionally, I have a dataset of locations (portion is shown):
structure(list(VarID = 1:9, chr = c(19L, 19L, 19L, 19L, 19L,
19L, 19L, 19L, 10L), position = c(58864801, 58863673, 58863673, 58863673,
58863673, 58863673, 58863673, 58863041, 52569689)), class = "data.frame", row.names = c(NA,
-9L))
I would like to append the second dataset with a column that specifies the domain to which VarID belongs.
My desired output is:
structure(list(VarID = 1:9, chr = c(19L, 19L, 19L, 19L, 19L,
19L, 19L, 19L, 10L), position = c(58864801, 58863673, 58863673,
58863673, 58863673, 58863673, 58863673, 58863041, 52569689),
domain = c("A1BG_-_-_4", "A1BG_-_-_2", "A1BG_-_-_2", "A1BG_-_-_2",
"A1BG_-_-_2", "A1BG_-_-_2", "A1BG_-_-_2", "A1BG_-_-_2", "A1CF_-_-_0"
)), row.names = c(NA, -9L), class = "data.frame")
Specifically, I'm having trouble getting the gsub to work that will eventually allow me to query whether or not a position is within the start..stop range.
chrcolumn is common to both.dputof the example so that it becomes more easierdputs.