I have a big file ~100k rows and 100 columns and I want to create extract the information of four columns based on another column. There is a column named Caller and that column tell you which columns with .sample will have info other than noSample.
I have tried with if and else if statements but sometimes two conditions are met and writting all the possible combinations would take a lot of effort and I am pretty sure there is a better way of doing it
My real data.frame looks like this one:
EDIT
Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
B= c(10,12,13,14,15,16,17),
Caller = c("A", "B", "C", "D", "A,C", "A,B,C", "B,D"),
A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
dummy1 = 1:7,
B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
dummy2 = 1:7,
C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
dummy3 = 1:7,
D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"), stringsAsFactors=FALSE)
I want to extract for each one of the rows a vector of samples. This could be stored on a list or another R object. I will use these samples to be matched against a data.frame where each sample is associated with a process.
My desired output would be
>row1
3xd|432
>row2
456|789|asd
>row3
zxc|vbn|mn
>row4
poi|uyh|gfrt|562
>row5
[1]1234|567|87sd [2]gfd3|123|456|789
>row6
[1]234|456|897a [2]674e|7892|123|432 [3]674e|7892|123
>row7
[1]bgcf|12er|567|zxs3|12ple [2]567|zxs3|12ple
My desired output wouldn't include the pipe | between samples but I can get rid of it using strsplit
Since the data.frame is big the speed would be essential.
noSamplenoSample, and that info has to be somehow indexed by row