5

I have a vector of strings, in the following format:

strings <- c("UUDBK", "KUVEB", "YVCYE")

I also have a data frame like this:

replacewith <- c(8, 4, 2)
searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN")
dataframe <- data.frame(replacewith, searchhere)

I want the strings vector to be replaced with the value in its corresponding "replacewith" column in this data frame. Currently the way I am doing it is:

final <- sapply(as.character(strings), function(x)
as.numeric(dataframe[grep(x, dataframe$searchhere), 1]))

However, this is very computationally heavy to be doing this with a character vector with length 10^9.

What is a better way to do this?

Thanks!

2 Answers 2

2

Similar to @AntoniosK's idea, this instead uses hashmap to map the strings to their values. hashmap is implemented with Rcpp internally, so it is very fast:

library(hashmap)
library(tidyr)

search_replace = separate_rows(dataframe, searchhere)

search_hash = hashmap(search_replace[,2], search_replace[,1])

search_hash[[strings]]

Results:

> search_hash
## (character) => (numeric)  
##     [KHUDN] => [+2.000000]
##     [KUEBN] => [+2.000000]
##     [UGEVB] => [+4.000000]
##     [KUVEB] => [+4.000000]
##     [IYVEK] => [+8.000000]
##     [IHVYV] => [+8.000000]
##       [...] => [...] 

> search_hash[[strings]]
[1] 8 4 8

Benchmarks:

> OP_func = function(){sapply(as.character(strings), function(x)
    as.numeric(dataframe[grep(x,dataframe$searchhere), 1]))}

Unit: microseconds
                           expr     min       lq      mean   median      uq      max neval
                      OP_func() 121.191 124.9410 190.36472 129.8760 151.193 3370.047   100
 d[d$searchhere %in% strings, ]  36.714  40.6605  52.85093  43.8185  61.583  147.246   100
         search_hash[[strings]]  14.212  18.1590  25.05212  21.5150  29.608   58.820   100

Also note that @AntoniosK's solution does not work if there are duplicates in strings, while hashmap will return the correct mapping for each element in the correct position.

Example:

> strings_large = sample(search_replace$searchhere, 100, replace = TRUE)
> strings_large
  [1] "YVCYE" "KUVEB" "KUYVE" "KHUDN" "KUYVE" "KHUDN" "KUEBN" "UUDBK" "KHUDN" "YVCYE" "IYVEK"
 [12] "KUEBN" "KHUDN" "IHBEJ" "YVCYE" "KHUDN" "KUEBN" "UGEVB" "UUDBK" "KUYVE" "KHUDN" "IHBEJ"
 [23] "IHVYV" "KUVEB" "IYVEK" "KHUDN" "KHUDN" "KUYVE" "YVCYE" "UUDBK" "KUYVE" "IHVYV" "KUYVE"
 [34] "KUEBN" "KUYVE" "UUDBK" "KUYVE" "KUVEB" "KUVEB" "YVCYE" "KUYVE" "KHUDN" "KUVEB" "YVCYE"
 [45] "IHBEJ" "YVCYE" "KHUDN" "UUDBK" "KUEBN" "IYVEK" "IHVYV" "UUDBK" "KUYVE" "KUEBN" "YVCYE"
 [56] "UGEVB" "YVCYE" "KUYVE" "IHVYV" "KUEBN" "IHVYV" "IHBEJ" "KUVEB" "IHVYV" "KUYVE" "KUEBN"
 [67] "IYVEK" "KUVEB" "KUEBN" "UGEVB" "KUEBN" "KUVEB" "IHBEJ" "KUYVE" "YVCYE" "YVCYE" "IHVYV"
 [78] "YVCYE" "KHUDN" "KHUDN" "YVCYE" "IYVEK" "KUYVE" "KHUDN" "UGEVB" "YVCYE" "IHVYV" "KUVEB"
 [89] "IYVEK" "KUEBN" "UGEVB" "UUDBK" "IYVEK" "IHBEJ" "IHBEJ" "UUDBK" "KUVEB" "UGEVB" "IYVEK"
[100] "IYVEK"

> search_hash[[strings_large]]
  [1] 8 4 8 2 8 2 2 8 2 8 8 2 2 2 8 2 2 4 8 8 2 2 8 4 8 2 2 8 8 8 8 8 8 2 8 8 8 4 4 8 8 2 4 8
 [45] 2 8 2 8 2 8 8 8 8 2 8 4 8 8 8 2 8 2 4 8 8 2 8 4 2 4 2 4 2 8 8 8 8 8 2 2 8 8 8 2 4 8 8 4
 [89] 8 2 4 8 8 2 2 8 4 4 8 8
Sign up to request clarification or add additional context in comments.

Comments

2
library(tidyr)

strings <- c("UUDBK", "KUVEB", "YVCYE")

replacewith <- c(8, 4, 2)
searchhere <- c("UUDBK, YVCYE, KUYVE, IHVYV, IYVEK", "KUVEB, UGEVB", "KUEBN, IHBEJ, KHUDN")
dataframe <- data.frame(replacewith, searchhere, stringsAsFactors = F)

# split strings to one row each
# like a look up table
d = separate_rows(dataframe, searchhere)

# get the number based on the look up table
d[d$searchhere %in% strings,]

#   replacewith searchhere
# 1           8      UUDBK
# 2           8      YVCYE
# 6           4      KUVEB

Not sure if you like this format, but you can always reshape it.

6 Comments

final shows only UUDBK KUVEB YVCYE 8 4 8. Am I missing something?
@AntoniosK Are you talking about my variable final in the original question? That is the desired output, a vector which now has the replaced values in it.
That was a reply to @RichScriven, because he mentioned something before. Does my code work for you?
Isn't working because I have duplicates in strings, like @useR mentioned. Thanks for the help though!
You're welcome. Happy to help. Your question/example should be representative of your real data. You didn't mention anything about duplicates. However, there's a way to modify the look up table to have unique values of strings depending on how you want to treat duplicates.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.