Replace values in one column based on part of text in another dataframe in R

Question

this is my first post in stackoverflow and english is not my first language, so I'll apologize in advance for any mistakes both in grammar and programming.

I need to replace values in one column of my data frame based on part of values which are in another data frame. My question is similar to this post here, but in their example they have all the possible errors mapped out. In my case, I only need a part of the string to know if I need to replace a value or not.

I already tried to use "if_else" and "grepl" with dplyr. "Grepl" works as long as I have only one row on the second dataframe, when I insert another example I get an error.

Right now my real DF has around 30k rows and 33 variables, and the second DF with the right values may grow every month, so I'm trying to run away from loops as much as I can.

I made a mock table with random data to simulate my need:

library(dplyr)


df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"), 
                  Value = c(100,200,300,400,200, 100,200,40,150,70))
                  
                  
df2 <- data.frame(Supplier =c("CC","EE","GG"), 
                  New_Supplier = c("Red","Blue","Green"))


#Example 1: Unfortunately this Won't work unless I have an exact match:
df1$Supplier <- if_else(df1$Supplier %in% df2$Supplier, df2$New_Supplier, df1$Supplier)

# Example 2: Only works if I have one example:
df1$Supplier <- if_else(grepl(df2$Supplier, df1$Supplier), df2$New_Supplier, df1$Supplier)

So I have this on the first data frame:

   Supplier Value
1       AAA   100
2       CCC   200
3       CCE   300
4       DDD   400
5       EEE   200
6       EED   100
7       GGG   200
8       HHH    40
9       III   150
10      JJJ    70

And this on the second data frame:

  Supplier New_Supplier
1       CC          Red
2       EE         Blue
3       GG        Green

My end goal is to have something like this:

  Supplier Value
1       AAA   100
2       Red   200
3       Red   300
4       DDD   400
5      Blue   200
6      Blue   100
7     Green   200
8       HHH    40
9       III   150
10      JJJ    70

Thanks in advance!

akrun · Accepted Answer · 2021-08-19 23:03:48Z

3

This seems to be a case for fuzzy_join with regex_left_join. After the regex_left_join, coalecse the columns together so that it will return the first non-NA element per each row

library(fuzzyjoin)
library(dplyr)
regex_left_join(df1, df2, by = 'Supplier') %>% 
    transmute(Supplier = coalesce(New_Supplier, Supplier.x), Value)

-output

 Supplier Value
1       AAA   100
2       Red   200
3       Red   300
4       DDD   400
5      Blue   200
6      Blue   100
7     Green   200
8       HHH    40
9       III   150
10      JJJ    70

answered Aug 19, 2021 at 23:03

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

hello_friend · Accepted Answer · 2021-08-20 02:55:59Z

A Base R approach:

# Coerce 0 length vectors to na values of the appropriate type: 
# zero_to_nas => function()
zero_to_nas <- function(x){
  if(identical(x, character(0))){
    res <- NA_character_ 
  }else if(identical(x, integer(0))){
    res < -NA_integer_
  }else if(identical(x, numeric(0))){
    res <- NA_real_
  }else if(identical(x, complex(0))){
    res <- NA_complex_
  }else if(identical(x, logical(0))){
    res <- NA
  }else{
    res <- x
  }
  
  # If the result is Null return the vector:
  if(is.null(res)){
    res <- x
  }else{
    invisible() 
  }
  
  # Explicitly define returned object: vector => Global Env
  return(res)
  
}

# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
  # Unlist cleaned list: res => vector
  res <- unlist(lapply(lst, zero_to_nas))
  # Explictly define return object: vector => GlobalEnv()
  return(res)
}

# Function to perform a fuzzy match: 
# fuzzy_match => function()
fuzzy_match <- function(vec_to_match_to, vec_to_match_on){
  # Perform a fuzzy match: res => character vector:
  res <- list_2_vec(
    regmatches(
      vec_to_match_to, 
      gregexpr(
        paste0(
          vec_to_match_on, 
          collapse = "|"
        ),
        vec_to_match_to
      )
    )
  )
  # Explicitly define returned object: 
  # character vector => Global Env
  return(res)
}

# Function to coalesce vectors: br_coalesce => function()
br_coalesce <- function(vec, ..., to_vec = TRUE){
  
  # Coalesce the vectors: res_ir => list
  res_ir <- apply(
    cbind(
        as.list(...), 
        as.list(vec)
      ),
    1,
    function(x){
      head(zero_to_nas(x[!(is.na(x))]), 1)
    }
  )
  
  # If the result is null return the original vector:
  if(is.null(unlist(res_ir))){
    res_ir <- vec
  }else{
    invisible() 
  }

  # If the we want the result to be a vector not a list then:
  if(isTRUE(to_vec)){
    # Unlist the resultant list: res => vector
    res <- unlist(res_ir)
    # Otherwise
  }else{
    # Deep copy the list: res => list
    res <- res_ir
  }
  
  # Explicitly define returned object: 
  # list or vector => Global Env
  return(res)
  
}

# Apply the fuzzy match and coalesce functions: 
# clean_df => data.frame
clean_df <- transform(
  df1, 
  Supplier = br_coalesce(
    df1$Supplier, 
    df2$New_Supplier[
      match(
        fuzzy_match(
          df1$Supplier, 
          df2$Supplier
        ),
        df2$Supplier
      )
    ]
  )
)

Data:

df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"), 
                  Value = c(100,200,300,400,200, 100,200,40,150,70))


df2 <- data.frame(Supplier =c("CC","EE","GG"), 
                  New_Supplier = c("Red","Blue","Green"))

Collectives™ on Stack Overflow

Replace values in one column based on part of text in another dataframe in R

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related