0

StackOverflow question

Hello fellows,

I am trying to "cross" multiple dataframes with R.

My data frames are coming from a high-throughput sequencing experiments and look like the followings :

df1 :

         chr  pos orient weight in_nucleosome in_subtelo
1  NC_001133  999      +      1          TRUE       TRUE
2  NC_001133 1505      -     14         FALSE       TRUE
3  NC_001133 1525      -      2          TRUE       TRUE
4  NC_001134  480      +      1          TRUE       TRUE
5  NC_001134  509      +      2         FALSE       TRUE
6  NC_001134  539      +      3         FALSE       TRUE
7  NC_001135 1218      +      1          TRUE       TRUE
8  NC_001135 1228      +      2          TRUE       TRUE
9  NC_001135 1273      +      1          TRUE       TRUE
10 NC_001136  362      +      1          TRUE       TRUE

and

df2:

         chr                feature  start    end orient
1  NC_001133                    ARS    707    776      .
2  NC_001133                    ARS   7997   8547      .
3  NC_001133                    ARS  30946  31183      .
4  NC_001133 ARS_consensus_sequence  31002  31018      +
5  NC_001133 ARS_consensus_sequence  70418  70434      -
6  NC_001133 ARS_consensus_sequence 124463 124479      -
7  NC_001136  blocked_reading_frame 721071 721481      -
8  NC_001137  blocked_reading_frame 375215 377614      -
9  NC_001141  blocked_reading_frame  29032  30048      +
10 NC_001133                    CDS    335    649      +

What I want to do is to know for a given chromosome ("chr" here) and for each df2$feature whether or not (df2$start < df1$pos < df2$end). I would then like to add a column to df1 whose name would be the one of the considered df2feature and filled with TRUE or FALSE in respect to the condition stated earlier.

I am pretty sure that the apply family of function have to be used maybe nested in one antoher but after hours of trying I can't manage to do it.

I did it in a very inelegant, long and error prone way with nested for loops but I am convinced there is a better simpler and maybe faster solution.

Thank you for reading this,

Antoine.

3
  • 2
    You may try foverlaps from data.table or findOverlaps from library(GenomicRanges) Commented Mar 30, 2015 at 16:23
  • Can you provide data that would provide some matches? I see nothing that would meet your constraints. Commented Mar 30, 2015 at 16:58
  • Thanks to you I realize my example where not so well chosen and there were typos in my question. I'll update it right away. Commented Mar 31, 2015 at 13:10

1 Answer 1

0

Though it may be possible with dplyr (I tried but am not that proficient), I got it to work (I think) with foreach and iterators:

Your data:

df1 <- structure(list(chr = c("NC_001133", "NC_001133", "NC_001133", "NC_001134", "NC_001134", "NC_001134", "NC_001135", "NC_001135", "NC_001135", "NC_001136"),
                      pos = c(999L, 1505L, 1525L, 480L, 509L, 539L, 1218L, 1228L, 1273L, 362L),
                      orient = c("+", "-", "-", "+", "+", "+", "+", "+", "+", "+"),
                      weight = c(1L, 14L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 1L),
                      in_nucleosome = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE),
                      in_subtelo = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)),
                 .Names = c("chr", "pos", "orient", "weight", "in_nucleosome", "in_subtelo"),
                 class = "data.frame",
                 row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

df2 <- structure(list(chr = c("NC_001133", "NC_001133", "NC_001133", "NC_001133", "NC_001133", "NC_001133", "NC_001136", "NC_001137", "NC_001141", "NC_001133"),
                      feature = c("ARS", "ARS", "ARS", "ARS_consensus_sequence", "ARS_consensus_sequence", "ARS_consensus_sequence", "blocked_reading_frame", "blocked_reading_frame", "blocked_reading_frame", "CDS"),
                      start = c(707L, 7997L, 30946L, 31002L, 70418L, 124463L, 721071L, 375215L, 29032L, 335L),
                      end = c(776L, 8547L, 31183L, 31018L, 70434L, 124479L, 721481L, 377614L, 30048L, 649L),
                      orient = c(".", ".", ".", "+", "-", "-", "-", "-", "+", "+")),
                 .Names = c("chr", "feature", "start", "end", "orient"),
                 class = "data.frame",
                 row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))

Since I think your data does not have any matches, I'll inject some:

## to be able to find *something*
df1$pos <- c(999, 1505, 8000, 480, 509, 539, 1218, 1228, 1272, 721072)

The code:

library(foreach)
library(iterators)

## pre-populate df1 with necessary columns
for (col in unique(df2$feature)) df1[,col] <- FALSE

df1a <- foreach (subdf1 = iter(df1, by='row'), .combine=rbind) %do% {
    features <- unique(df2$feature[df2$chr== subdf1$chr])
    for (feature in features) {
        idx <- (df2$chr == subdf1$chr) & (feature == df2$feature)
        if (length(idx)) {
            subdf1[feature] <- any((df2$start[idx] < subdf1$pos) & (subdf1$pos < df2$end[idx]))
        }
    }
    subdf1
}

df1a
##          chr    pos orient weight in_nucleosome in_subtelo   ARS
## 1  NC_001133    999      +      1          TRUE       TRUE FALSE
## 2  NC_001133   1505      -     14         FALSE       TRUE FALSE
## 3  NC_001133   8000      -      2          TRUE       TRUE  TRUE
## 4  NC_001134    480      +      1          TRUE       TRUE FALSE
## 5  NC_001134    509      +      2         FALSE       TRUE FALSE
## 6  NC_001134    539      +      3         FALSE       TRUE FALSE
## 7  NC_001135   1218      +      1          TRUE       TRUE FALSE
## 8  NC_001135   1228      +      2          TRUE       TRUE FALSE
## 9  NC_001135   1272      +      1          TRUE       TRUE FALSE
## 10 NC_001136 721072      +      1          TRUE       TRUE FALSE
##    ARS_consensus_sequence blocked_reading_frame   CDS
## 1                   FALSE                 FALSE FALSE
## 2                   FALSE                 FALSE FALSE
## 3                   FALSE                 FALSE FALSE
## 4                   FALSE                 FALSE FALSE
## 5                   FALSE                 FALSE FALSE
## 6                   FALSE                 FALSE FALSE
## 7                   FALSE                 FALSE FALSE
## 8                   FALSE                 FALSE FALSE
## 9                   FALSE                 FALSE FALSE
## 10                  FALSE                  TRUE FALSE

An easy side-effect of using foreach and iterators is that, if the data is large and you use doParallel, just replace %do% with %dopar% and things go as parallel as you define. You could preface all of the above with something like:

library(doParallel)
cl <- makeCluster(detectCores() - 1) # leaving one available is "A Good Thing (tm)"
registerDoParallel(cl)

## replace %do% with %dopar%, do all of the above code

## clean up
stopCluster(cl)
Sign up to request clarification or add additional context in comments.

4 Comments

I'll try to understand and master your code, then I'll try it on my real data frames and update this post.
Your code runs perfectly on the example given here but fails on my real data without giving any error, it just runs and never replace the FALSE by TRUE when it should. It is like if the condition was never fulfilled, I tried for (col in unique(df2$feature)) df1[,col] <- "XXX" and it never replaces the "XXX". I'm not sure I'll manage to debug that. I'll try working on an intricated solution with for loops and hope someone can explain how to obtain the same result with some kind of four-liner with dplyer or ddply! Thank you for your time.
Are df1 and df2 just subsets of the true data, or did you randomly and separately create them? Perhaps there's something about the data that I can't see here.
They are more or less randomly (poorly) selected subset of the true data, they might not capture the whole complexity of the real data but the structure is supposedly the same. I have to admit that I am a bit lost.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.