1

Is there a way where I can find all the partial matches from df_2 to df_1?

partial match (if part of DF_1 string is in the whole string of DF_2) For example, part of "for solution" is in the whole string of "solution"

df_1=data.frame(
  DF_1=c("suspension","tablet","for solution","capsule")
)

df_2=data.frame(
  index=c("1","2","3","4","5"),
  DF_2=c("for suspension", "suspension", "solution", "tablet,ER","tablet,IR")
)

df_out=data.frame(
  DF_1=c("suspension","suspension","tablet","tablet","for solution"),
  DF_2=c("for suspension", "suspension","tablet,ER","tablet,IR","solution"),
  index=c("1","2","4","5","3")
)
5
  • How would you define a partial match? According to your example, would it be "the string chain in df_a is totally contained in df_b"? Commented Aug 4, 2020 at 21:03
  • Does this answer your question? Test if characters are in a string Commented Aug 4, 2020 at 21:03
  • @Arault, I defined a partial match above. If part of my string in DF_1 is in DF_2. so for example, part of "for solution" is in "solution", so that's a match. Commented Aug 4, 2020 at 21:06
  • @Ashti In that case, shouldn't "for solution" merge with "for suspension"? Both have "for" Commented Aug 4, 2020 at 21:11
  • no because part of "for solution" is in "solution" as a whole but not "for suspension" as a whole string. Commented Aug 4, 2020 at 21:13

4 Answers 4

1

Following @Akrun suggestion of using fuzzyjoin

According to your expected output, you want to join twice, and you want to perform inner_join. Finally you'll match twice if there is a perfect match, which is why you want to deduplicate (I did it with distinct from dplyr but you can do it with what you want.

df_out = distinct(
  rbind(
    regex_inner_join(df_1, df_2, by = c("DF_1"= "DF_2")),
    regex_inner_join(df_2, df_1, by = c("DF_2"= "DF_1"))
  )
)
df_out

The output is:

          DF_1 index           DF_2
1   suspension     2     suspension
2 for solution     3       solution
3   suspension     1 for suspension
4       tablet     4      tablet,ER
5       tablet     5      tablet,IR

You find your expected table, not in the same order though (lines & columns).

Sign up to request clarification or add additional context in comments.

Comments

0

This sounds like a job for grepl()!

E.g. grepl(value, chars, fixed = TRUE) Let me quote an example from a different answer:

> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE

Comments

0

We can use fuzzyjoin

library(fuzzyjoin)
regex_left_join(df_2, df_1, by = c("DF_2"= "DF_1"))

1 Comment

thanks! this didn't return a match for one of the rows.
0

Here is a base R option using nested *apply + grepl

df_out <- within(
  df_2,
  DF_1 <- unlist(sapply(
    DF_2,
    function(x) {
      Filter(
        Negate(is.na),
        lapply(
          df_1$DF_1,
          function(y) ifelse(grepl(y, x), y, ifelse(grepl(x, y), x, NA))
        )
      )
    }
  ), use.names = FALSE)
)

such that

> df_out
  index           DF_2       DF_1
1     1 for suspension suspension
2     2     suspension suspension
3     3       solution   solution
4     4      tablet,ER     tablet
5     5      tablet,IR     tablet

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.