Detect string in a different dataframe, return value from column in R

Question

In a dataframe I have a character column (one word) where each word can appear multiple times

word = c(
   "OMEPRAZOL",
   "PARACETAMOL",
   "HIDROFEROL",
   "ENALAPRIL",
   "PARACETAMOL",
   "NOISE"
)

In a different dataframe I have a column with strings and another with an associated ID code:

string_code = data.frame(
   string = c(
   "OMEPRAZOL XXXX",
   "OMEPRAZOL YYYY",
   "PARACETAMOL/A XXXX",
   "PARACETAMOL/B YYYY",
   "HIDROFEROL XXXX",
   "ENALAPRIL XXXX",
   "ENALAPRIL YYYY"),
   code = c(
   "11",
   "11",
   "22",
   "22",
   "33",
   "44",
   "44")
)

I would like look up for each element of word in string_code$string and when there is a match get in return the associated ID from string_code$code (only the first match since there might be multiple ones, and the ID is the same anyway) - NA if no match.

word_code = data.frame(
   word = c(
   "OMEPRAZOL",
   "PARACETAMOL",
   "HIDROFEROL",
   "ENALAPRIL",
   "PARACETAMOL",
   "NOISE"),
   code = c(
   "11",
   "22",
   "33",
   "44",
   "22",
   "NA")
)

Taufi · Accepted Answer · 2021-04-21 17:41:57Z

1

This is a potential application for regex_full_join() from the fuzzyjoin package.

Try

    fuzzyjoin::regex_full_join(string_code, word) %>% select(-1) %>% distinct

to obtain

>   fuzzyjoin::regex_full_join(string_code, word) %>% select(-1) %>% distinct
Joining by: "string"
  code    string.y
1   11   OMEPRAZOL
2   22 PARACETAMOL
3   33  HIDROFEROL
4   44   ENALAPRIL
5 <NA>       NOISE

Note that you need define word like so,

  word <- as.data.frame(word)
  colnames(word) <- "string"

answered Apr 21, 2021 at 17:41

Taufi

1,5979 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

supagleech Over a year ago

thanks so much this seems to be working fine. But now i have another issue: "Error: cannot allocate vector of size X Gb"... I guess this is a different question, but now Im here... what would you recommend? Splitting the large string_code df? Or maybe just keeping the X first characters of string_code$string, where X = max(nchar(word))?

Taufi Over a year ago

You could split the dataframe and apply the solution above to all parts. I wouldn't censor the dataframe. Due to its size you may not know precisely what you delete.

Simon · Accepted Answer · 2021-04-21 17:43:57Z

0

You could also simply do

word_code <- data.frame(word=word, code=sapply(word, function(w){string_code$code[grep(w, string_code$string)[1]]}))

answered Apr 21, 2021 at 17:43

Simon

6124 silver badges10 bronze badges

Collectives™ on Stack Overflow

Detect string in a different dataframe, return value from column in R

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related