R - fail to store multiple output from for loop in vector or data frame

Question

I have a data frame called lbt_all_epitopes of 38282 rows and three columns, as shown below:

 sequence    score epitope.
1 RPGGPPGYRTPYTAK 1.724911  Epitope
2 TQGDRQKIQDAVSAA 1.664611  Epitope
3 EVKSRYNVDVSQNKR 1.593236  Epitope
4 VIEMTRAFEDDDFDK 1.578200  Epitope
5 ITQGDRQKIQDAVSA 1.533208  Epitope
6 GSADLTPSNLTRPAS 1.532700  Epitope

In the first column (named sequence) I have multiple similar strings, which I want to remove (I will be looking for similar strings using str_sub). For example, considering the first string of lbt_all_epitopes$sequence ("RPGGPPGYRTPYTAK") I want to look for similar strings in the whole column and store them in a vector or in a data.frame, which will be called to_be_removed. I want to do this iteration for the first 30 elements present in lbt_all_epitopes$sequence. For the sake of simplicity, let's just consider the top five rows. When I run the loop, like the one below:

# Iterate over the first 5 rows
top_30 <- 1:5

for(i in top_30) {
  print(agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T))
}

The output:

 [1] "RPGGPPGYRTPYTAK" "VGTRPGGPPGYRTPY" "TRPGGPPGYRTPYTA" "GGPPGYRTPYTAKPF" "PGGPPGYRTPYTAKP"
 [6] "LVGTRPGGPPGYRTP" "TLVGTRPGGPPGYRT" "GPPGYRTPYTAKPFV" "PPGYRTPYTAKPFVM" "GTRPGGPPGYRTPYT"
[11] "PGYRTPYTAKPFVMC"
 [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
 [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DRQKIQDAVSAASSW" "RQKIQDAVSAASSWL"
[11] "QKIQDAVSAASSWLE"
 [1] "EVKSRYNVDVSQNKR" "VKSRYNVDVSQNKRA" "NEVKSRYNVDVSQNK" "KSRYNVDVSQNKRAR" "LNEVKSRYNVDVSQN"
 [6] "YNVDVSQNKRARLRL" "RYNVDVSQNKRARLR" "MLNEVKSRYNVDVSQ" "SRYNVDVSQNKRARL" "HMLNEVKSRYNVDVS"
[11] "EHMLNEVKSRYNVDV"
 [1] "VIEMTRAFEDDDFDK" "RVIEMTRAFEDDDFD" "GDRVIEMTRAFEDDD" "DRVIEMTRAFEDDDF" "IEMTRAFEDDDFDKF"
 [6] "RGDRVIEMTRAFEDD" "EMTRAFEDDDFDKFD" "FRGDRVIEMTRAFED" "MTRAFEDDDFDKFDR" "TRAFEDDDFDKFDRV"
[11] "RAFEDDDFDKFDRVR"
 [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
 [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DVQNGITQGDRQKIQ" "DRQKIQDAVSAASSW"
[11] "RQKIQDAVSAASSWL"

Is exactly what I want i.e. it printed all the similar strings (11 per iteration) to the first, second, third...fifth elements of lbt_all_epitopes$sequence. However, when I try to store the output in a vector (called to_be_removed), with the following loop:

# create the empty vector where I will store the output
to_be_removed <- c()

for(i in top_30) {
  to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

I noticed that each iteration produced only a single string as output (as opposed to 11 strings for each iteration), as below:

> to_be_removed
[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"

The following warning message was displayed:

Warning messages:
1: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
2: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
3: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
4: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
5: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length

I am then assuming that I am missing the code telling R that it should also concatenate all the strings produced by each iteration, then go to the next iteration. Does anyone know how to correctly store the output in a vector, or even in a data.frame?

I'm pretty sure that you cannot store an object of length > 1 in a single entry of a vector. Why not use a list? Try something like to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T)) — LAP
– LAP, Commented Jan 26, 2017 at 9:19
By the way, could you provide your dataset in form of dput(head(lbt_all_epitopes))? — LAP
– LAP, Commented Jan 26, 2017 at 9:21
Thanks, it does the job, just as the adapted loop from the colleague below. Do you know any other way to store the output in a data.frame? In this case, it would be best to have a data frame, such that I can look for the strings in to_be_removed in my original dataset (lbt_all_epitopes) to remove them. Thanks. Yes next time I will poste with dput — BCArg
– BCArg, Commented Jan 26, 2017 at 10:05
Well, do you want a single string in every column of the data.frame, or just all strings together in one column? — LAP
– LAP, Commented Jan 26, 2017 at 10:09
I want to store the output such that I can further look for them in my lbt_all_epitopes. For example I tried to exclude what was in the to_be_excludedlist with subset <- lbt_all_epitopes[!lbt_all_epitopes$sequence %in% to_be_removed, ] it did not work though. — BCArg
– BCArg, Commented Jan 26, 2017 at 10:24

Etienne Kintzler · Accepted Answer · 2017-01-26 09:26:45Z

2

You can create a list :

# create the empty vector where I will store the output
to_be_removed <- list()

for(i in top_30) {
  to_be_removed[[i]] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

Notice the double bracket to fill the list.

Also next time please post your data using dput so we can use it directly. To do so do : dput(lbt_all_epitopes) which returns :

structure(list(X = 1:6, sequence = structure(c(4L, 5L, 1L, 6L, 
3L, 2L), .Label = c("EVKSRYNVDVSQNKR", "GSADLTPSNLTRPAS", "ITQGDRQKIQDAVSA", 
"RPGGPPGYRTPYTAK", "TQGDRQKIQDAVSAA", "VIEMTRAFEDDDFDK"), class = "factor"), 
    score = structure(c(6L, 5L, 4L, 3L, 2L, 1L), .Label = c("1.532700", 
    "1.533208", "1.578200", "1.593236", "1.664611", "1.724911"
    ), class = "factor"), epitope. = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L), .Label = "Epitope", class = "factor")), .Names = c("X", 
"sequence", "score", "epitope."), class = "data.frame", row.names = c(NA, 
-6L))

answered Jan 26, 2017 at 9:26

Etienne Kintzler

6926 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

BCArg Over a year ago

if I use the command you mentioned, I got: class = "factor"), score = c(1.7249113, 1.6646106, 1.5932359, 1.5782, 1.5332078, 1.5326996), epitope. = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Epitope", "Non-Epitope"), class = "factor")), .Names = c("sequence", "score", "epitope."), row.names = c(NA, 6L), class = "data.frame") > Is this what you mean? Thanks the loop does the job, although it would be nicer to store the output in a data.frame. Any idea on how to do that?

LAP Over a year ago

Thanks for providing the dput(), @EtienneKintzler!

Etienne Kintzler Over a year ago

Yes dput() is really awesome @LeoP. I found it on stackoverflow.com/questions/1295955/… you can check, you might learn some interesting functions

Etienne Kintzler Over a year ago

@BCArg you can store the results in the dataframe because the result of each iteration doesn't have the same length. If you know the maximal length of the output for each iteration you can use the following code (for instance 6): tmp <- lapply(to_be_removed, function(x) {length(x) <- 6; x} data.frame(tmp) The function in lapply will change the length of every list in the list, and put NA if the lists in the list have less than 6 elements.

Etienne Kintzler Over a year ago

@BCArg You didn't copy and paste correctly the output of dput since the output doesn't begin with structure(... Also since I copy paste your data in excel then import them in R it's possible that the elements within the structure differs; for instance the field .Names in the output of my dput does contain the value X because read.csv import the row.names (1,2,..6) as a column (with default name X)

|

LAP · Accepted Answer · 2017-01-26 10:23:16Z

1

To avoid a growing for()-loop, we can use lapply(). This should be faster when handling huge datasets.

to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T))

gives a list with the extracted strings for each row in a separate list entry:

[[1]]
[1] "RPGGPPGYRTPYTAK"

[[2]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

[[3]]
[1] "EVKSRYNVDVSQNKR"

[[4]]
[1] "VIEMTRAFEDDDFDK"

[[5]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

Now you can separate those with strsplit() and unlist() them into a vector (which you could use to subset):

to_be_removed <- unlist(lapply(to_be_removed, function(x) strsplit(x, " ")))

Output:

[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"
[7] "ITQGDRQKIQDAVSA"

answered Jan 26, 2017 at 10:23

LAP

6,7152 gold badges19 silver badges31 bronze badges

2 Comments

BCArg Over a year ago

Excellent, this is exactly what I want! I also tried the dput command from etiennekintzler (dput(lbt_all_epitopes)) and I got something completely different. Do you know why?

LAP Over a year ago

Glad to help! dput() gives you an output for your whole dataframe, which usually is quite big and therefore the code is pretty long. For an example for SO, use either dput(head(yourdata)) or - if that is insufficient - manually limit the dimensions: dput(yourdata[1:20, 1:5]).

Collectives™ on Stack Overflow

R - fail to store multiple output from for loop in vector or data frame

2 Answers 2

9 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related