Join rows in a data frame which have similar (but not equal) values

Question

I have a df like:

   SampleID Chr Start End    Strand  Value
1:   rep1     1 11001 12000     -     10
2:   rep1     1 15000 20100     -     5
3:   rep2     1 11070 12050     -     1
4:   rep3     1 14950 20090     +     20
...

And I want to join the rows that share the same chr and strand and that have similar starting and end points (say like with 100 +/- distance). For those columns that the row join is performed, I would also like to concatenate the SampleID names and the Value. With the previous example, something like:

   SampleID Chr Start End    Strand  Value
1:rep1,rep2   1 11001 12000     -     10,1
2:   rep1     1 15000 20100     -     5
4:   rep3     1 14950 20090     +     20
...

Ideas? Thanks!

EDIT:

I found the fuzzyjoin package for R (https://cran.r-project.org/web/packages/fuzzyjoin/index.html). Does anyone have experience with this package?

EDIT2:

It would be also nice if just one of the variables (SampleID or Value) would be concatenated.

akrun · Accepted Answer · 2017-11-18 12:29:39Z

1

We could do group by 'Chr', 'Strand', create a grouping ID based on the difference between adjacent elements in 'Start' and 'End' columns after ordering by 'Start', 'End', then grouped by 'Chr', 'Strand' and 'ind', get the first element of 'Start', 'End', while pasteing the elements in 'SampleID' and 'Value' column

library(data.table)
df[order(Start, End), ind := rleid((Start - shift(Start, fill = Start[1])) < 100 & 
     (End -  shift(End, fill = End[1])) < 100), by =.(Chr, Strand)
   ][, .(Start = Start[1], End = End[1], 
     SampleID = toString(SampleID), Value = toString(Value)) , .(Strand, Chr, ind),]
#     Strand Chr ind Start   End   SampleID Value
#1:      -   1   1 11001 12000 rep1, rep2 10, 1
#2:      -   1   2 15000 20100       rep1     5
#3:      +   1   1 14950 20090       rep3    20

NOTE: Assumed that 'df' is a data.table

answered Nov 18, 2017 at 12:29

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Tato14 Over a year ago

WOW! That's great, I would like to know if you could be a little bit more explicit with the command that you just used. Correct me if I'm wrong: You generate a "Label" for each condition looking at the Chr, Strand and the Start and End which fulfills the requirement that I asked (+/- 100). Then you joined the ones with the same "Label" and add the SampleID and Value as a string. Am I right? Another curious thing that happens in my data.table is that the rows that are exactly at the same position do not perform the join. Where would be the problem?

akrun Over a year ago

@Tato14 In the first set of df[i, j, by] we specify the i with order so that the values are ordered in the ascending order. The by is the grouping variables 'Chr', 'Strand', In the 'j', we assign the run-length-id (rleid) of logical output as a new columns, then in the second set, we are doing a group by operation. It is not clear when u say that rows that are exactly at the sam position do not perform the join

Tato14 Over a year ago

Thanks for the explanation! Regarding the last part, by "position" I meant same Chrom, Strand, Start and End.

Tato14 Over a year ago

I also found in the output some rows with joined SampleIDs that do not match the conditions about Start and End but get the same id. If I only run the first part of the command a warning message appear:

RHS 1 is length 344 (greater than the size (32) of group 1). The last 312 element(s) will be discarded. RHS 1 is length 344 (greater than the size (5) of group 2)...

Collectives™ on Stack Overflow

Join rows in a data frame which have similar (but not equal) values

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related