0

I am trying to extract regions from a chromosome/position data frame that correspond with a related list of chromosomal positions depicting histone locations. My current "pipeline":

import original dataset containing chromosome and position:

> mapinfo<- read.table()
> colnames(mapinfo)<-c("CHR","M")
> mapinfo$CHR<- paste("chr",mapinfo$CHR),sep="")
> head(mapinfo)
CHR         M
1  chrX  24072640
2  chr9 131463936
3 chr14 105176736
4 chr13 115000168
5  chr8  74791285
6 chr19   3676340

import bed file containing histone locations

> bed<-read.table()
> names(bed)<- c("Chr","Start","Stop")
> head(bed)
Chr     Start      Stop
1  chr4  76806896  76807598
2  chrY  10034763  10036639
3  chr2 133036421 133037716
4 chr21  27227897  27228500
5  chr1 145036931 145041607
6  chr2  91777964  91779762

Generate chromosome specific codes for use in subsetting from mapinfo data frame

> Mcodes<- by(bed,bed$Chr,function(x){paste("M>=",bed$Start,"&M<=",bed$Stop,sep="",collapse="|")})
> Mcodes[chr1]
chr1
"M>=130786932&M<=130787255|M>=133156512&M<=133156894..."

Subset original mapinfo dataset by chromosome:

> subs<- split(mapinfo,mapinfo$CHR)

At this point I can use the following line in order to subset my desired regions individually by chromosome:

> CHR1<- eval(parse(text=paste0('subset(subs$chr1,',Mcodes["chr1"],')')))

I would like to subset all chromosome specific data frames contained in "subs" by their chromosome specific corresponding list in "Mcodes" without having to run that last line of code 24 different times (as I have multiple bed files for various histones/histone variants that will eventually need to be put through the same pipeline). Is there a way to loop/apply/something to make that possible?

Sorry if it seems a trivial question - I am still very new to the R/programming game. Thanks for any advice.

1 Answer 1

1

Here's a toy example that, if I've understood you correctly, should help:

Here's a toy mapinfo:

mapinfo <- data.frame(
  CHR = paste0("chr", rep(1:3, each = 3)),
  M   = c(1, 10, 100)
)
mapinfo
#>    CHR   M
#> 1 chr1   1
#> 2 chr1  10
#> 3 chr1 100
#> 4 chr2   1
#> 5 chr2  10
#> 6 chr2 100
#> 7 chr3   1
#> 8 chr3  10
#> 9 chr3 100

A toy bed:

bed <- data.frame(
  CHR   = paste0("chr", rep(1:3, each = 2)),
  Start = c(0, 5, 15),
  Stop  = c(5, 15, 120)
)
bed
#>    CHR Start Stop
#> 1 chr1     0    5
#> 2 chr1     5   15
#> 3 chr2    15  120
#> 4 chr2     0    5
#> 5 chr3     5   15
#> 6 chr3    15  120

Then, if you don't have them, install the packages dplyr and purrr (using install.packages(c("dplyr", "purrr")), and run the following:

library(dplyr)
library(purrr)

mapinfo %>%
  left_join(bed) %>%
  filter(pmap_lgl(list(M, Start, Stop), between))
#>    CHR   M Start Stop
#> 1 chr1   1     0    5
#> 2 chr1  10     5   15
#> 3 chr2   1     0    5
#> 4 chr2 100    15  120
#> 5 chr3  10     5   15
#> 6 chr3 100    15  120

This is a data frame of all elements in mapinfo that were between Start and Stop values specified in bed. At this point, you can go ahead and split it up as desired, remove Start and Stop columns, etc.

An important note: CHR must have exactly the same label in both data frames (e.g., all caps). I noticed that you had in all caps in one and just first-letter caps in another. Be consistent here (a) for this code to work and (b) because consistency makes your code easier to digest.

To help you develop your understanding of the code, here are some relevant links to "R for Data Science" by Garrett Grolemund and Hadley Wickham:

Sign up to request clarification or add additional context in comments.

4 Comments

Getting NA values for "Start" and "Stop" following left_join(mapinfo,bed). Unlike your toy example my dataset contains multiple "M" positions for a chromosome that fall w/in the start/stop region for that chromosome in the bed file - doesn't that create a fundamental problem for joining? Or maybe I am still not understanding the join function fully.
I tried to get that sort of overlap in my example. Best I can think of - make sure that there are matches for all values of CHR. E.g., chr1, chr2, ... appear in both data frames (n.b. these will be case sensitive). In case, check whether they are characters or factors (class(mapinfo$CHR)).
No luck. I still think there is a fundamental problem with joining in this case - as you need to be able to specify "or" like you can using "|" when subsetting mapinfo based on conditions specified in bed. Otherwise once things are joined you can only traverse row-wise, you can't filter based on values contained in other rows. Your example doesn't contain multiple values for M that fall into a singular region of the bed file.
Are you able to post up an example data set that contains the problems you're dealing with? If so, please use dput(), and please post your expected output based on the data you provide.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.