-2

There is an error when i try to select data from large dataset(d1) based on small dataset(d2). Below is my script and the problem.

**d1 <- read.table("MSv25.txt",header=T)

d2 <- read.table("Flairall.Gene.txt",header=T)

d2$low <- d2$start-10000 ; d2$high <- d2$end+10000

d1$matched <- apply(d1,1,function(p) which(p['POS'] >=d2[,'low'] & p['POS'] <= d2[,'high'] & p['CHR']==d2[,"chromosome"]))

d3 <- cbind(d1[which(d1$matched >0),], d2[unlist(d1$matched[which(d1$matched>0)]),])

write.table(d3,file="Flairall.GOBSgene.txt",quote=FALSE,sep="\t",row.names=FALSE,col.names=TRUE)**

The d1 is like this:

SNP CHR POS A1  A2  OR  P
rs1000007   2   237416793   C   T   0.9785  0.4868
rs1000003   3   99825597    G   A   0.9091  0.009774
rs1000002   3   185118462   C   T   1.0111  0.6765
rs10000012  4   1347325 G   C   1.0045  0.9087
rs10000042  4   5288053 T   C   1.0622  0.3921
rs10000062  4   5305645 G   C   1.0116  0.779
rs10000132  4   7450570 T   C   0.9734  0.4748
rs10000081  4   16957461    C   T   1.0288  0.3585
rs10000100  4   19119591    A   G   1.0839  0.1417
rs10000010  4   21227772    C   T   0.971   0.2693
rs10000092  4   21504615    C   T   1.0342  0.27
rs10000068  4   36600682    T   C   1.055   0.103
rs10000013  4   36901464    C   A   1.0198  0.5379
rs10000037  4   38600725    A   G   1.0249  0.4217
rs10000017  4   84997149    T   C   0.9576  0.1912
rs10000109  4   91586292    A   T   0.9956  0.9349
rs10000023  4   95952929    T   G   0.9998  0.9951
rs10000030  4   103593179   A   G   1.0839  0.04208
rs10000111  4   107137517   A   G   1.0812  0.3128
rs10000124  4   109325900   A   C   1.0642  0.1906
rs10000064  4   128029071   C   T   1.0388  0.1578
rs10000029  4   138905074   C   T   0.7832  0.14
rs10000036  4   139438712   T   C   0.9848  0.5683
rs10000033  4   139819348   C   T   0.9918  0.7542
rs10000121  4   157793485   A   G   1.0008  0.9769
rs10000041  4   165841405   G   T   1.0042  0.9146
rs10000082  4   167529642   T   C   0.9733  0.6612

The d2 is like this:

gene    start   end chromosome
WFDC9   237416000   237418000   2
SRGAP3  19119590    21504615    4

In general, i want to select the SNPs within the genes by extending a window of 10kb at the start and end position .

Here is my script results:

   SNP CHR       POS A1 A2     OR      P matched  gene     start       end chromosome       low      high
    1 rs1000007   2 237416793  C  T 0.9785 0.4868       1 WFDC9 237416000 237418000          2 237406000 237428000

Which is not correct, as the there is one gene missing. The correct one should be:

gene    start   end chromosome  SNP CHR POS A1  A2  OR  P
WFDC9   237416000   237418000   2   rs1000007   2   237416793   C   T   0.9785  0.4868
SRGAP3  19119590    21504615    4   rs10000100  4   19119591    A   G   1.0839  0.1417
SRGAP3  19119590    21504615    4   rs10000010  4   21227772    C   T   0.971   0.2693
SRGAP3  19119590    21504615    4   rs10000092  4   21504615    C   T   1.0342  0.27

Could anyone help me to point out what is wrong.... Many thanks...

1

1 Answer 1

2

Your code seems to be completely backwards to what you're trying to achieve:

"For each gene (in d2) which SNPs (from d1) are within 10kb of that gene?"

First of all, your code for d1$matched is backwards. All your p's and d2s should be the other way round (currently it doesn't make much sense?), giving you a list of SNPs whom are in cis with each gene (+/- 10kb).

I would approach it the way i've phrased your question:

cisWindow <- 10000 # size of your +/- window, in this case 10kb.
d3 <- data.frame()
# For each gene, locate the cis-SNPs
for (i in 1:nrow(d2)) {
  # Broken down into steps for readability.
  inCis <- d1[which(d1[,"CHR"] == d2[i, "chromosome"]),]
  inCis <- inCis[which(inCis[,"POS"] >= (d2[i, "start"] - cisWindow)),]
  inCis <- inCis[which(inCis[,"POS"] <= (d2[i, "end"] + cisWindow)),]
  # Now we have the cis-SNPs, so lets build the data.frame for this gene,
  # and grow our data.frame d3:
  if (nrow(inCis) > 0) {
    d3 <- rbind(d3, cbind(d2[i,], inCis))
  }
}

I tried to find a solution which didn't involve growing d3 in the loop, but because you're attaching each row of d2 to 0 or more rows from d1 I wasn't able to come up with a solution that's not horribly inefficient.

Sign up to request clarification or add additional context in comments.

4 Comments

I've fixed the code so that it won't crash for genes who have 0 cis-SNPs.
@user2669497, this may do the trick, but depending on the number of SNPs, genes and your genome, this may be (very) inefficient. The algorithm you're looking for is Interval tree. The package you should be using is GenomicRanges.
Huh, I didn't know about that package! I spawned a new discussion due to my frustration formulating an answer here: stackoverflow.com/questions/18840410/…
An sqldf query was able to give me all cis-SNPs (+/- 10Kb range) for 35k genes and 2 million markers in just under 30 minutes.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.