1

I have a df like:

   SampleID Chr Start End    Strand  Value
1:   rep1     1 11001 12000     -     10
2:   rep1     1 15000 20100     -     5
3:   rep2     1 11070 12050     -     1
4:   rep3     1 14950 20090     +     20
...

And I want to join the rows that share the same chr and strand and that have similar starting and end points (say like with 100 +/- distance). For those columns that the row join is performed, I would also like to concatenate the SampleID names and the Value. With the previous example, something like:

   SampleID Chr Start End    Strand  Value
1:rep1,rep2   1 11001 12000     -     10,1
2:   rep1     1 15000 20100     -     5
4:   rep3     1 14950 20090     +     20
...

Ideas? Thanks!

EDIT:

I found the fuzzyjoin package for R (https://cran.r-project.org/web/packages/fuzzyjoin/index.html). Does anyone have experience with this package?

EDIT2:

It would be also nice if just one of the variables (SampleID or Value) would be concatenated.

0

1 Answer 1

1

We could do group by 'Chr', 'Strand', create a grouping ID based on the difference between adjacent elements in 'Start' and 'End' columns after ordering by 'Start', 'End', then grouped by 'Chr', 'Strand' and 'ind', get the first element of 'Start', 'End', while pasteing the elements in 'SampleID' and 'Value' column

library(data.table)
df[order(Start, End), ind := rleid((Start - shift(Start, fill = Start[1])) < 100 & 
     (End -  shift(End, fill = End[1])) < 100), by =.(Chr, Strand)
   ][, .(Start = Start[1], End = End[1], 
     SampleID = toString(SampleID), Value = toString(Value)) , .(Strand, Chr, ind),]
#     Strand Chr ind Start   End   SampleID Value
#1:      -   1   1 11001 12000 rep1, rep2 10, 1
#2:      -   1   2 15000 20100       rep1     5
#3:      +   1   1 14950 20090       rep3    20

NOTE: Assumed that 'df' is a data.table

Sign up to request clarification or add additional context in comments.

4 Comments

WOW! That's great, I would like to know if you could be a little bit more explicit with the command that you just used. Correct me if I'm wrong: You generate a "Label" for each condition looking at the Chr, Strand and the Start and End which fulfills the requirement that I asked (+/- 100). Then you joined the ones with the same "Label" and add the SampleID and Value as a string. Am I right? Another curious thing that happens in my data.table is that the rows that are exactly at the same position do not perform the join. Where would be the problem?
@Tato14 In the first set of df[i, j, by] we specify the i with order so that the values are ordered in the ascending order. The by is the grouping variables 'Chr', 'Strand', In the 'j', we assign the run-length-id (rleid) of logical output as a new columns, then in the second set, we are doing a group by operation. It is not clear when u say that rows that are exactly at the sam position do not perform the join
Thanks for the explanation! Regarding the last part, by "position" I meant same Chrom, Strand, Start and End.
I also found in the output some rows with joined SampleIDs that do not match the conditions about Start and End but get the same id. If I only run the first part of the command a warning message appear: RHS 1 is length 344 (greater than the size (32) of group 1). The last 312 element(s) will be discarded. RHS 1 is length 344 (greater than the size (5) of group 2)...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.