Data selection error [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed 12 years ago.

Improve this question

There is an error when i try to select data from large dataset(d1) based on small dataset(d2). Below is my script and the problem.

**d1 <- read.table("MSv25.txt",header=T)

d2 <- read.table("Flairall.Gene.txt",header=T)

d2$low <- d2$start-10000 ; d2$high <- d2$end+10000

d1$matched <- apply(d1,1,function(p) which(p['POS'] >=d2[,'low'] & p['POS'] <= d2[,'high'] & p['CHR']==d2[,"chromosome"]))

d3 <- cbind(d1[which(d1$matched >0),], d2[unlist(d1$matched[which(d1$matched>0)]),])

write.table(d3,file="Flairall.GOBSgene.txt",quote=FALSE,sep="\t",row.names=FALSE,col.names=TRUE)**

The d1 is like this:

SNP CHR POS A1  A2  OR  P
rs1000007   2   237416793   C   T   0.9785  0.4868
rs1000003   3   99825597    G   A   0.9091  0.009774
rs1000002   3   185118462   C   T   1.0111  0.6765
rs10000012  4   1347325 G   C   1.0045  0.9087
rs10000042  4   5288053 T   C   1.0622  0.3921
rs10000062  4   5305645 G   C   1.0116  0.779
rs10000132  4   7450570 T   C   0.9734  0.4748
rs10000081  4   16957461    C   T   1.0288  0.3585
rs10000100  4   19119591    A   G   1.0839  0.1417
rs10000010  4   21227772    C   T   0.971   0.2693
rs10000092  4   21504615    C   T   1.0342  0.27
rs10000068  4   36600682    T   C   1.055   0.103
rs10000013  4   36901464    C   A   1.0198  0.5379
rs10000037  4   38600725    A   G   1.0249  0.4217
rs10000017  4   84997149    T   C   0.9576  0.1912
rs10000109  4   91586292    A   T   0.9956  0.9349
rs10000023  4   95952929    T   G   0.9998  0.9951
rs10000030  4   103593179   A   G   1.0839  0.04208
rs10000111  4   107137517   A   G   1.0812  0.3128
rs10000124  4   109325900   A   C   1.0642  0.1906
rs10000064  4   128029071   C   T   1.0388  0.1578
rs10000029  4   138905074   C   T   0.7832  0.14
rs10000036  4   139438712   T   C   0.9848  0.5683
rs10000033  4   139819348   C   T   0.9918  0.7542
rs10000121  4   157793485   A   G   1.0008  0.9769
rs10000041  4   165841405   G   T   1.0042  0.9146
rs10000082  4   167529642   T   C   0.9733  0.6612

The d2 is like this:

gene    start   end chromosome
WFDC9   237416000   237418000   2
SRGAP3  19119590    21504615    4

In general, i want to select the SNPs within the genes by extending a window of 10kb at the start and end position .

Here is my script results:

   SNP CHR       POS A1 A2     OR      P matched  gene     start       end chromosome       low      high
    1 rs1000007   2 237416793  C  T 0.9785 0.4868       1 WFDC9 237416000 237418000          2 237406000 237428000

Which is not correct, as the there is one gene missing. The correct one should be:

gene    start   end chromosome  SNP CHR POS A1  A2  OR  P
WFDC9   237416000   237418000   2   rs1000007   2   237416793   C   T   0.9785  0.4868
SRGAP3  19119590    21504615    4   rs10000100  4   19119591    A   G   1.0839  0.1417
SRGAP3  19119590    21504615    4   rs10000010  4   21227772    C   T   0.971   0.2693
SRGAP3  19119590    21504615    4   rs10000092  4   21504615    C   T   1.0342  0.27

Could anyone help me to point out what is wrong.... Many thanks...

This is the third time you've asked this question: stackoverflow.com/questions/18573957/… stackoverflow.com/questions/18157323/… Why were these answers insufficient, and what have you tried since? — Scott Ritchie
– Scott Ritchie, Commented Sep 16, 2013 at 7:32

Scott Ritchie · Accepted Answer · 2013-09-17 06:36:59Z

2

Your code seems to be completely backwards to what you're trying to achieve:

"For each gene (in d2) which SNPs (from d1) are within 10kb of that gene?"

First of all, your code for d1$matched is backwards. All your p's and d2s should be the other way round (currently it doesn't make much sense?), giving you a list of SNPs whom are in cis with each gene (+/- 10kb).

I would approach it the way i've phrased your question:

cisWindow <- 10000 # size of your +/- window, in this case 10kb.
d3 <- data.frame()
# For each gene, locate the cis-SNPs
for (i in 1:nrow(d2)) {
  # Broken down into steps for readability.
  inCis <- d1[which(d1[,"CHR"] == d2[i, "chromosome"]),]
  inCis <- inCis[which(inCis[,"POS"] >= (d2[i, "start"] - cisWindow)),]
  inCis <- inCis[which(inCis[,"POS"] <= (d2[i, "end"] + cisWindow)),]
  # Now we have the cis-SNPs, so lets build the data.frame for this gene,
  # and grow our data.frame d3:
  if (nrow(inCis) > 0) {
    d3 <- rbind(d3, cbind(d2[i,], inCis))
  }
}

I tried to find a solution which didn't involve growing d3 in the loop, but because you're attaching each row of d2 to 0 or more rows from d1 I wasn't able to come up with a solution that's not horribly inefficient.

edited Sep 17, 2013 at 6:36

answered Sep 16, 2013 at 8:23

Scott Ritchie

10.6k3 gold badges34 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Scott Ritchie Over a year ago

I've fixed the code so that it won't crash for genes who have 0 cis-SNPs.

Arun Over a year ago

@user2669497, this may do the trick, but depending on the number of SNPs, genes and your genome, this may be (very) inefficient. The algorithm you're looking for is Interval tree. The package you should be using is GenomicRanges.

Scott Ritchie Over a year ago

Huh, I didn't know about that package! I spawned a new discussion due to my frustration formulating an answer here: stackoverflow.com/questions/18840410/…

Scott Ritchie Over a year ago

An sqldf query was able to give me all cis-SNPs (+/- 10Kb range) for 35k genes and 2 million markers in just under 30 minutes.

Collectives™ on Stack Overflow

Data selection error [closed]

1 Answer 1

4 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Linked

Related