3

In a dataframe I have a character column (one word) where each word can appear multiple times

word = c(
   "OMEPRAZOL",
   "PARACETAMOL",
   "HIDROFEROL",
   "ENALAPRIL",
   "PARACETAMOL",
   "NOISE"
)

In a different dataframe I have a column with strings and another with an associated ID code:

string_code = data.frame(
   string = c(
   "OMEPRAZOL XXXX",
   "OMEPRAZOL YYYY",
   "PARACETAMOL/A XXXX",
   "PARACETAMOL/B YYYY",
   "HIDROFEROL XXXX",
   "ENALAPRIL XXXX",
   "ENALAPRIL YYYY"),
   code = c(
   "11",
   "11",
   "22",
   "22",
   "33",
   "44",
   "44")
)

I would like look up for each element of word in string_code$string and when there is a match get in return the associated ID from string_code$code (only the first match since there might be multiple ones, and the ID is the same anyway) - NA if no match.

word_code = data.frame(
   word = c(
   "OMEPRAZOL",
   "PARACETAMOL",
   "HIDROFEROL",
   "ENALAPRIL",
   "PARACETAMOL",
   "NOISE"),
   code = c(
   "11",
   "22",
   "33",
   "44",
   "22",
   "NA")
)

2 Answers 2

1

This is a potential application for regex_full_join() from the fuzzyjoin package.

Try

    fuzzyjoin::regex_full_join(string_code, word) %>% select(-1) %>% distinct

to obtain

>   fuzzyjoin::regex_full_join(string_code, word) %>% select(-1) %>% distinct
Joining by: "string"
  code    string.y
1   11   OMEPRAZOL
2   22 PARACETAMOL
3   33  HIDROFEROL
4   44   ENALAPRIL
5 <NA>       NOISE

Note that you need define word like so,

  word <- as.data.frame(word)
  colnames(word) <- "string"
Sign up to request clarification or add additional context in comments.

2 Comments

thanks so much this seems to be working fine. But now i have another issue: "Error: cannot allocate vector of size X Gb"... I guess this is a different question, but now Im here... what would you recommend? Splitting the large string_code df? Or maybe just keeping the X first characters of string_code$string, where X = max(nchar(word))?
You could split the dataframe and apply the solution above to all parts. I wouldn't censor the dataframe. Due to its size you may not know precisely what you delete.
0

You could also simply do

word_code <- data.frame(word=word, code=sapply(word, function(w){string_code$code[grep(w, string_code$string)[1]]}))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.