2

this is my first post in stackoverflow and english is not my first language, so I'll apologize in advance for any mistakes both in grammar and programming.

I need to replace values in one column of my data frame based on part of values which are in another data frame. My question is similar to this post here, but in their example they have all the possible errors mapped out. In my case, I only need a part of the string to know if I need to replace a value or not.

I already tried to use "if_else" and "grepl" with dplyr. "Grepl" works as long as I have only one row on the second dataframe, when I insert another example I get an error.

Right now my real DF has around 30k rows and 33 variables, and the second DF with the right values may grow every month, so I'm trying to run away from loops as much as I can.

I made a mock table with random data to simulate my need:

library(dplyr)


df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"), 
                  Value = c(100,200,300,400,200, 100,200,40,150,70))
                  
                  
df2 <- data.frame(Supplier =c("CC","EE","GG"), 
                  New_Supplier = c("Red","Blue","Green"))


#Example 1: Unfortunately this Won't work unless I have an exact match:
df1$Supplier <- if_else(df1$Supplier %in% df2$Supplier, df2$New_Supplier, df1$Supplier)

# Example 2: Only works if I have one example:
df1$Supplier <- if_else(grepl(df2$Supplier, df1$Supplier), df2$New_Supplier, df1$Supplier)

So I have this on the first data frame:

   Supplier Value
1       AAA   100
2       CCC   200
3       CCE   300
4       DDD   400
5       EEE   200
6       EED   100
7       GGG   200
8       HHH    40
9       III   150
10      JJJ    70

And this on the second data frame:

  Supplier New_Supplier
1       CC          Red
2       EE         Blue
3       GG        Green

My end goal is to have something like this:

  Supplier Value
1       AAA   100
2       Red   200
3       Red   300
4       DDD   400
5      Blue   200
6      Blue   100
7     Green   200
8       HHH    40
9       III   150
10      JJJ    70

Thanks in advance!

2 Answers 2

3

This seems to be a case for fuzzy_join with regex_left_join. After the regex_left_join, coalecse the columns together so that it will return the first non-NA element per each row

library(fuzzyjoin)
library(dplyr)
regex_left_join(df1, df2, by = 'Supplier') %>% 
    transmute(Supplier = coalesce(New_Supplier, Supplier.x), Value)

-output

 Supplier Value
1       AAA   100
2       Red   200
3       Red   300
4       DDD   400
5      Blue   200
6      Blue   100
7     Green   200
8       HHH    40
9       III   150
10      JJJ    70
Sign up to request clarification or add additional context in comments.

Comments

1

A Base R approach:

# Coerce 0 length vectors to na values of the appropriate type: 
# zero_to_nas => function()
zero_to_nas <- function(x){
  if(identical(x, character(0))){
    res <- NA_character_ 
  }else if(identical(x, integer(0))){
    res < -NA_integer_
  }else if(identical(x, numeric(0))){
    res <- NA_real_
  }else if(identical(x, complex(0))){
    res <- NA_complex_
  }else if(identical(x, logical(0))){
    res <- NA
  }else{
    res <- x
  }
  
  # If the result is Null return the vector:
  if(is.null(res)){
    res <- x
  }else{
    invisible() 
  }
  
  # Explicitly define returned object: vector => Global Env
  return(res)
  
}

# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
  # Unlist cleaned list: res => vector
  res <- unlist(lapply(lst, zero_to_nas))
  # Explictly define return object: vector => GlobalEnv()
  return(res)
}

# Function to perform a fuzzy match: 
# fuzzy_match => function()
fuzzy_match <- function(vec_to_match_to, vec_to_match_on){
  # Perform a fuzzy match: res => character vector:
  res <- list_2_vec(
    regmatches(
      vec_to_match_to, 
      gregexpr(
        paste0(
          vec_to_match_on, 
          collapse = "|"
        ),
        vec_to_match_to
      )
    )
  )
  # Explicitly define returned object: 
  # character vector => Global Env
  return(res)
}

# Function to coalesce vectors: br_coalesce => function()
br_coalesce <- function(vec, ..., to_vec = TRUE){
  
  # Coalesce the vectors: res_ir => list
  res_ir <- apply(
    cbind(
        as.list(...), 
        as.list(vec)
      ),
    1,
    function(x){
      head(zero_to_nas(x[!(is.na(x))]), 1)
    }
  )
  
  # If the result is null return the original vector:
  if(is.null(unlist(res_ir))){
    res_ir <- vec
  }else{
    invisible() 
  }

  # If the we want the result to be a vector not a list then:
  if(isTRUE(to_vec)){
    # Unlist the resultant list: res => vector
    res <- unlist(res_ir)
    # Otherwise
  }else{
    # Deep copy the list: res => list
    res <- res_ir
  }
  
  # Explicitly define returned object: 
  # list or vector => Global Env
  return(res)
  
}

# Apply the fuzzy match and coalesce functions: 
# clean_df => data.frame
clean_df <- transform(
  df1, 
  Supplier = br_coalesce(
    df1$Supplier, 
    df2$New_Supplier[
      match(
        fuzzy_match(
          df1$Supplier, 
          df2$Supplier
        ),
        df2$Supplier
      )
    ]
  )
)

Data:

df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"), 
                  Value = c(100,200,300,400,200, 100,200,40,150,70))


df2 <- data.frame(Supplier =c("CC","EE","GG"), 
                  New_Supplier = c("Red","Blue","Green"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.