2

Suppose I have a data frame with a few numbers in the first column. I want to take these numbers, use them as locations in a string, and take a substring that includes 2 characters before and after that location. To clarify,

aggSN <- data.frame(V1=c(5,6,7,8),V2="blah")
gen <- "AJSDAFKSDAFJKLASDFKJKA"  # <- take this string
aggSN                            # <- take the numbers in the first column
# V1    V2
#  5  blah
#  6  blah
#  7  blah
#  8  blah

and create a new column V3 that looks like

aggSN                           
# V1    V2    V3
#  5  blah SDAFK   # <- took the two characters before and after the 5th character
#  6  blah DAFKS   # <- took the two characters before and after the 6th character 
#  7  blah AFKSD   # <- took the two characters before and after the 7th character 
# 10  blah SDAFJ   # <- took the two characters before and after the 10th character 
#  2  blah AJSD   # <- here you can see that it the substring cuts off 

Currently I am using a for loop, which works, but takes a lot of time on very large data frames and large strings. Are there any alternatives to this? Thank you.

fillvector <- ""
for(j in 1:nrow(aggSN)){fillvector[j] <- substr(gen,aggSN[j,V1]-2,aggSN[j,V1]+2)}
aggSN$V9 <- fillvector
5
  • 1
    What happens if boundaries are out of range 1 .. length(gen)? Commented Aug 3, 2015 at 2:24
  • I will edit my post shortly after this comment, but I would want the substring to just cut off. Commented Aug 3, 2015 at 2:26
  • It should be aggSN[j, V1], shouldn't it? Commented Aug 3, 2015 at 2:27
  • From where comes your "10" in aggSN, after creating V3? Commented Aug 3, 2015 at 2:36
  • Sorry,you're right, I changed it to aggSN[j,V1] Commented Aug 3, 2015 at 2:38

2 Answers 2

4

You can use substring() without writing a loop

aggSN <- data.frame(V1=c(5,6,7,8,2),V2="blah")
gen <- "AJSDAFKSDAFJKLASDFKJKA" 

with(aggSN, substring(gen, V1-2, V1+2))
# [1] "SDAFK" "DAFKS" "AFKSD" "FKSDA" "AJSD" 

So to add the new column,

aggSN$V3 <- with(aggSN, substring(gen, V1-2, V1+2))
aggSN
#   V1   V2    V3
# 1  5 blah SDAFK
# 2  6 blah DAFKS
# 3  7 blah AFKSD
# 4  8 blah FKSDA
# 5  2 blah  AJSD

If you are after something a bit faster, I would go with stringi::stri_sub in place of substring().

Sign up to request clarification or add additional context in comments.

4 Comments

guess your answer didn't load. this is the way to go. one of these days I'll learn how to use with like a pro.
interesting, I didn't know the difference between substr and substring. My initial attempt was with with and substr which produced a constant vector, before I switched to sapply. This is definitely better.
@MichaelChirico - I had deleted it for a few minutes then realized I had it right. Sorry about that.
@Ricky I had to ?substr to remind myself. I initially thought it was substr that accepts vector indices. See the examples in ?substr
2
aggSN$V3 <- sapply(aggSN$V1, function(x) substr(gen, x-2, x+2))

should do the trick.

> aggSN
  V1   V2    V3
1  5 blah SDAFK
2  6 blah DAFKS
3  7 blah AFKSD
4  8 blah FKSDA

With your different example

> aggSN
  V1   V2    V3
1  5 blah SDAFK
2  6 blah DAFKS
3  7 blah AFKSD
4 10 blah SDAFJ
5  2 blah  AJSD

1 Comment

the reproducible example differs from the example where the V3 is shown. The latter one is what I refer to as "different example".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.