2

I have a big file ~100k rows and 100 columns and I want to create extract the information of four columns based on another column. There is a column named Caller and that column tell you which columns with .sample will have info other than noSample.

I have tried with if and else if statements but sometimes two conditions are met and writting all the possible combinations would take a lot of effort and I am pretty sure there is a better way of doing it

My real data.frame looks like this one:

EDIT

 Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
             B= c(10,12,13,14,15,16,17),
             Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
             A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
             dummy1 = 1:7,
             B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
             dummy2 = 1:7,
             C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
             dummy3 = 1:7,
             D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"), stringsAsFactors=FALSE)

I want to extract for each one of the rows a vector of samples. This could be stored on a list or another R object. I will use these samples to be matched against a data.frame where each sample is associated with a process.

  My desired output would be

  >row1
  3xd|432 
  >row2
   456|789|asd
  >row3
  zxc|vbn|mn
  >row4
  poi|uyh|gfrt|562
  >row5
  [1]1234|567|87sd [2]gfd3|123|456|789
  >row6
  [1]234|456|897a [2]674e|7892|123|432  [3]674e|7892|123
  >row7
  [1]bgcf|12er|567|zxs3|12ple  [2]567|zxs3|12ple

My desired output wouldn't include the pipe | between samples but I can get rid of it using strsplit

Since the data.frame is big the speed would be essential.

7
  • It looks like you are trying to take band diagonals from your data frame. You might want to format your data like a table/matrix so this point gets across. Commented Dec 26, 2018 at 13:26
  • @TimBiegeleisen, it is not always a perfect diagonal, in some cases a whole column of samples could have all values as noSample Commented Dec 26, 2018 at 13:28
  • How about that formatting? Try to give us a minimal question. Commented Dec 26, 2018 at 13:28
  • Sorry if I don't understand your point, but I want to extract only the sample info from those columns with noSample, and that info has to be somehow indexed by row Commented Dec 26, 2018 at 13:30
  • How important is it to denote the samples with [1], etc. in the output vector? Commented Dec 26, 2018 at 16:09

2 Answers 2

2

Here is a possible solution:

Df <- data.frame(A = c("chr1", "chr1", "chr1", "chr1", "chr1", "chr1", "chr1"),
                 B= c(10,12,13,14,15,16,17),
                 Caller = c("A", "B", "C",  "D", "A,C", "A,B,C", "B,D"),
                 A.sample = c("3xd|432", "noSample","noSample","noSample","1234|567|87sd","234|456|897a","noSample"),
                 B.sample = c("noSample", "456|789|asd", "noSample","noSample","noSample","674e|7892|123|432","bgcf|12er|567|zxs3|12ple"),
                 C.sample = c("noSample","noSample", "zxc|vbn|mn","noSample","gfd3|123|456|789","674e|7892|123","noSample" ),
                 D.sample = c("noSample","noSample", "noSample", "poi|uyh|gfrt|562", "noSample", "noSample", "567|zxs3|12ple"),
                 stringsAsFactors=FALSE)

#find names of columns
names<-substr(names(Df), 1, 1)
#Set unwanted names to NA
names[-c(4:ncol(Df))]<-NA

#create a regular expression by replacing the comma with the or |
reg<-gsub(",", "\\|", Df$Caller)

#find the column matches
columns<-sapply(reg, function(x){grep(x, names)})    

#extract the desired columns out into a list
lapply(seq_along(columns), function(x){Df[x,columns[[x]]]})

I added stringsAsFactors=FALSE to the data frame definition in order to remove the baggage related to the Factor levels.

Sign up to request clarification or add additional context in comments.

2 Comments

it works perfectly but in my real data.set the columns (A.sample, B.sample, C.sample, D.sample) are not consecutive they are in positions c(8,10,12,14), I don't know how to fix the columns step to get the correct columns, since you have used a +3 to get the correct index, right?
@user2380782, I made an edit about to take care of the nonconsecutive columns, just substitute in your array c(8, 10, 12, 14) into the line names[-c(...)]<-NA
2

Showing just one of many possible ways to achieve the desired result. Note that I use the same dataframe as @Dave2e, i.e. I have added stringsAsFactors=F to the call to data.frame.

library(tidyverse)
out <- df %>% rowid_to_column() %>% # adding explicit row IDs
       gather(key, value, -rowid, -A, -B, -Caller) %>% # reshaping the dataframe
       filter(value != "noSample")

The resulting dataframe will look like this:

out
   rowid    A  B Caller      key                    value
1      1 chr1 10      A A.sample                  3xd|432
2      5 chr1 15    A,C A.sample            1234|567|87sd
3      6 chr1 16  A,B,C A.sample             234|456|897a
4      2 chr1 12      B B.sample              456|789|asd
5      6 chr1 16  A,B,C B.sample        674e|7892|123|432
6      7 chr1 17    B,D B.sample bgcf|12er|567|zxs3|12ple
7      3 chr1 13      C C.sample               zxc|vbn|mn
8      5 chr1 15    A,C C.sample         gfd3|123|456|789
9      6 chr1 16  A,B,C C.sample            674e|7892|123
10     4 chr1 14      D D.sample         poi|uyh|gfrt|562
11     7 chr1 17    B,D D.sample           567|zxs3|12ple

Now we can simply subset to retrieve the desired result:

out[out$rowid == 1,"value"]
[1] "3xd|432"
out[out$rowid == 5,"value"]
[1] "1234|567|87sd"    "gfd3|123|456|789"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.