1

I have a data frame called lbt_all_epitopes of 38282 rows and three columns, as shown below:

 sequence    score epitope.
1 RPGGPPGYRTPYTAK 1.724911  Epitope
2 TQGDRQKIQDAVSAA 1.664611  Epitope
3 EVKSRYNVDVSQNKR 1.593236  Epitope
4 VIEMTRAFEDDDFDK 1.578200  Epitope
5 ITQGDRQKIQDAVSA 1.533208  Epitope
6 GSADLTPSNLTRPAS 1.532700  Epitope

In the first column (named sequence) I have multiple similar strings, which I want to remove (I will be looking for similar strings using str_sub). For example, considering the first string of lbt_all_epitopes$sequence ("RPGGPPGYRTPYTAK") I want to look for similar strings in the whole column and store them in a vector or in a data.frame, which will be called to_be_removed. I want to do this iteration for the first 30 elements present in lbt_all_epitopes$sequence. For the sake of simplicity, let's just consider the top five rows. When I run the loop, like the one below:

# Iterate over the first 5 rows
top_30 <- 1:5

for(i in top_30) {
  print(agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T))
}

The output:

 [1] "RPGGPPGYRTPYTAK" "VGTRPGGPPGYRTPY" "TRPGGPPGYRTPYTA" "GGPPGYRTPYTAKPF" "PGGPPGYRTPYTAKP"
 [6] "LVGTRPGGPPGYRTP" "TLVGTRPGGPPGYRT" "GPPGYRTPYTAKPFV" "PPGYRTPYTAKPFVM" "GTRPGGPPGYRTPYT"
[11] "PGYRTPYTAKPFVMC"
 [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
 [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DRQKIQDAVSAASSW" "RQKIQDAVSAASSWL"
[11] "QKIQDAVSAASSWLE"
 [1] "EVKSRYNVDVSQNKR" "VKSRYNVDVSQNKRA" "NEVKSRYNVDVSQNK" "KSRYNVDVSQNKRAR" "LNEVKSRYNVDVSQN"
 [6] "YNVDVSQNKRARLRL" "RYNVDVSQNKRARLR" "MLNEVKSRYNVDVSQ" "SRYNVDVSQNKRARL" "HMLNEVKSRYNVDVS"
[11] "EHMLNEVKSRYNVDV"
 [1] "VIEMTRAFEDDDFDK" "RVIEMTRAFEDDDFD" "GDRVIEMTRAFEDDD" "DRVIEMTRAFEDDDF" "IEMTRAFEDDDFDKF"
 [6] "RGDRVIEMTRAFEDD" "EMTRAFEDDDFDKFD" "FRGDRVIEMTRAFED" "MTRAFEDDDFDKFDR" "TRAFEDDDFDKFDRV"
[11] "RAFEDDDFDKFDRVR"
 [1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "GITQGDRQKIQDAVS" "NGITQGDRQKIQDAV" "QGDRQKIQDAVSAAS"
 [6] "QNGITQGDRQKIQDA" "GDRQKIQDAVSAASS" "VQNGITQGDRQKIQD" "DVQNGITQGDRQKIQ" "DRQKIQDAVSAASSW"
[11] "RQKIQDAVSAASSWL"

Is exactly what I want i.e. it printed all the similar strings (11 per iteration) to the first, second, third...fifth elements of lbt_all_epitopes$sequence. However, when I try to store the output in a vector (called to_be_removed), with the following loop:

# create the empty vector where I will store the output
to_be_removed <- c()

for(i in top_30) {
  to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

I noticed that each iteration produced only a single string as output (as opposed to 11 strings for each iteration), as below:

> to_be_removed
[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"

The following warning message was displayed:

Warning messages:
1: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
2: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
3: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
4: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length
5: In to_be_removed[i] <- agrep(str_sub(lbt_all_epitopes$sequence[i],  :
  number of items to replace is not a multiple of replacement length

I am then assuming that I am missing the code telling R that it should also concatenate all the strings produced by each iteration, then go to the next iteration. Does anyone know how to correctly store the output in a vector, or even in a data.frame?

6
  • 1
    I'm pretty sure that you cannot store an object of length > 1 in a single entry of a vector. Why not use a list? Try something like to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T)) Commented Jan 26, 2017 at 9:19
  • 1
    By the way, could you provide your dataset in form of dput(head(lbt_all_epitopes))? Commented Jan 26, 2017 at 9:21
  • Thanks, it does the job, just as the adapted loop from the colleague below. Do you know any other way to store the output in a data.frame? In this case, it would be best to have a data frame, such that I can look for the strings in to_be_removed in my original dataset (lbt_all_epitopes) to remove them. Thanks. Yes next time I will poste with dput Commented Jan 26, 2017 at 10:05
  • Well, do you want a single string in every column of the data.frame, or just all strings together in one column? Commented Jan 26, 2017 at 10:09
  • I want to store the output such that I can further look for them in my lbt_all_epitopes. For example I tried to exclude what was in the to_be_excludedlist with subset <- lbt_all_epitopes[!lbt_all_epitopes$sequence %in% to_be_removed, ] it did not work though. Commented Jan 26, 2017 at 10:24

2 Answers 2

2

You can create a list :

# create the empty vector where I will store the output
to_be_removed <- list()

for(i in top_30) {
  to_be_removed[[i]] <- agrep(str_sub(lbt_all_epitopes$sequence[i], start = 5, end = 11), lbt_all_epitopes$sequence, value = T)
}

Notice the double bracket to fill the list.

Also next time please post your data using dput so we can use it directly. To do so do : dput(lbt_all_epitopes) which returns :

structure(list(X = 1:6, sequence = structure(c(4L, 5L, 1L, 6L, 
3L, 2L), .Label = c("EVKSRYNVDVSQNKR", "GSADLTPSNLTRPAS", "ITQGDRQKIQDAVSA", 
"RPGGPPGYRTPYTAK", "TQGDRQKIQDAVSAA", "VIEMTRAFEDDDFDK"), class = "factor"), 
    score = structure(c(6L, 5L, 4L, 3L, 2L, 1L), .Label = c("1.532700", 
    "1.533208", "1.578200", "1.593236", "1.664611", "1.724911"
    ), class = "factor"), epitope. = structure(c(1L, 1L, 1L, 
    1L, 1L, 1L), .Label = "Epitope", class = "factor")), .Names = c("X", 
"sequence", "score", "epitope."), class = "data.frame", row.names = c(NA, 
-6L))
Sign up to request clarification or add additional context in comments.

9 Comments

if I use the command you mentioned, I got: class = "factor"), score = c(1.7249113, 1.6646106, 1.5932359, 1.5782, 1.5332078, 1.5326996), epitope. = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Epitope", "Non-Epitope"), class = "factor")), .Names = c("sequence", "score", "epitope."), row.names = c(NA, 6L), class = "data.frame") > Is this what you mean? Thanks the loop does the job, although it would be nicer to store the output in a data.frame. Any idea on how to do that?
Thanks for providing the dput(), @EtienneKintzler!
Yes dput() is really awesome @LeoP. I found it on stackoverflow.com/questions/1295955/… you can check, you might learn some interesting functions
@BCArg you can store the results in the dataframe because the result of each iteration doesn't have the same length. If you know the maximal length of the output for each iteration you can use the following code (for instance 6): tmp <- lapply(to_be_removed, function(x) {length(x) <- 6; x} data.frame(tmp) The function in lapply will change the length of every list in the list, and put NA if the lists in the list have less than 6 elements.
@BCArg You didn't copy and paste correctly the output of dput since the output doesn't begin with structure(... Also since I copy paste your data in excel then import them in R it's possible that the elements within the structure differs; for instance the field .Names in the output of my dput does contain the value X because read.csv import the row.names (1,2,..6) as a column (with default name X)
|
1

To avoid a growing for()-loop, we can use lapply(). This should be faster when handling huge datasets.

to_be_removed <- lapply(lbt_all_epitopes$sequence[1:5], function(x) agrep(str_sub(x, start = 5, end = 11), lbt_all_epitopes$sequence, value = T))

gives a list with the extracted strings for each row in a separate list entry:

[[1]]
[1] "RPGGPPGYRTPYTAK"

[[2]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

[[3]]
[1] "EVKSRYNVDVSQNKR"

[[4]]
[1] "VIEMTRAFEDDDFDK"

[[5]]
[1] "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA"

Now you can separate those with strsplit() and unlist() them into a vector (which you could use to subset):

to_be_removed <- unlist(lapply(to_be_removed, function(x) strsplit(x, " ")))

Output:

[1] "RPGGPPGYRTPYTAK" "TQGDRQKIQDAVSAA" "ITQGDRQKIQDAVSA" "EVKSRYNVDVSQNKR" "VIEMTRAFEDDDFDK" "TQGDRQKIQDAVSAA"
[7] "ITQGDRQKIQDAVSA"

2 Comments

Excellent, this is exactly what I want! I also tried the dput command from etiennekintzler (dput(lbt_all_epitopes)) and I got something completely different. Do you know why?
Glad to help! dput() gives you an output for your whole dataframe, which usually is quite big and therefore the code is pretty long. For an example for SO, use either dput(head(yourdata)) or - if that is insufficient - manually limit the dimensions: dput(yourdata[1:20, 1:5]).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.