R: Replace string when partial match to another column by row

Question

I would like to replace/remove those parts of a string (name) that match to other columns (stateand city) in my data table.

I managed to identify the rows, e.g. with city, like so: dt%>% filter(str_detect(name, city)) but I am missing a way to use gsub (or grep) with the rowwise value of the column city.

I know that a rather manual approach like storing all city names in a vector and enter them in gsub would work but it would also falsely remove the "dallas" of row 2. (This was manageable for states though and could be combined with gsub to also remove "of".)

Data and desired output

dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"), 
state=c("pr", "pr", "pr", "ny", "tx"), 
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"), 
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))

GuedesBF · Accepted Answer · 2021-11-30 11:32:58Z

3

With dplyr, we can use rowwise. First collapse all words to remove into a single character element with the OR metacharacter (as in 'arecibo|pr|of'), then call str_remove_all with that pattern. Finally, remove remaining whitespaces.

library(dplyr)
library(stringr)

dt %>%
    rowwise()%>%
    mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
               trimws())

# A tibble: 5 × 5
# Rowwise: 
  city      state name                   desired       desired_2    
  <chr>     <chr> <chr>                  <chr>         <chr>        
1 arecibo   pr    frutas of pr arecibo   frutas        frutas       
2 arecibo   pr    dallas frutas of pr    dallas frutas dallas frutas
3 cabo rojo pr    cabo rojo metal plant  metal plant   metal plant  
4 new york  ny    greens new york        greens        greens       
5 dallas    tx    cowboy shoes dallas tx cowboy shoes  cowboy shoes

edited Nov 30, 2021 at 11:32

answered Nov 30, 2021 at 10:08

GuedesBF

9,9515 gold badges23 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

GuedesBF Over a year ago

Sorry, we have to use rowwise to achieve that, which I used originally, but eventually removed. It is ok now. Please test the updated answer now.

s_baldur · Accepted Answer · 2021-11-30 11:35:54Z

2

A data.table solution:

# Helper function
subxy <-  function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)

dt[,  desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]

#         city state                   name       desired      desired2
# 1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
# 2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
# 4:  new york    ny        greens new york        greens        greens
# 5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

edited Nov 30, 2021 at 11:35

answered Nov 30, 2021 at 10:20

s_baldur

34.6k4 gold badges43 silver badges80 bronze badges

4 Comments

s_baldur Over a year ago

@Magasinus you are using an old version of R but to mitigate you can replace \(x, y) with function(x,y)

s_baldur Over a year ago

and |> with the magrittr pipe %>%.

Magasinus Over a year ago

This solution is great but has the same issue when applied to real data: it also cleans real names that start with the state abbreviation. Is there a more elegant way to account for this than:

subxy <-  function(string, rmv) mapply(function(x, y) sub(paste(" ",x," ",sep=""), '', paste(" ",y," ",sep="")), rmv, string)

and does this really do the job?

s_baldur Over a year ago

@Magasinus You can take that into account by requiring space before the statename, simply feeding subxy() with paste0(' ', state).

MonJeanJean · Accepted Answer · 2021-11-30 10:10:27Z

2

Here's a solution, but it can probably be achieved faster with gsub methods. Anyway :

library(tidyverse)


  dt %>% 
  mutate(test = str_remove_all(name,city)) %>% 
  mutate(test = str_remove_all(test,paste(" of ",state,sep=""))) %>% 
  mutate(test = str_remove_all(test,state)) %>% 
  mutate(test = str_remove_all(test,"^ ")) %>% 
  mutate(test = str_remove_all(test," *$"))

Output:

        city state                   name       desired          test
1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
4:  new york    ny        greens new york        greens        greens
5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

edited Nov 30, 2021 at 10:10

answered Nov 30, 2021 at 9:57

MonJeanJean

2,9161 gold badge7 silver badges22 bronze badges

3 Comments

Magasinus Over a year ago

Can we combine this to replace "of" only when followed by a space and state?

MonJeanJean Over a year ago

Sure, see the edit @Magasinus

Magasinus Over a year ago

With real data, we have to be careful given states like "ma". It might be advisable to use mutate(test = str_remove_all(test,paste(" ",state," ", sep="")).

Collectives™ on Stack Overflow

R: Replace string when partial match to another column by row

3 Answers 3

1 Comment

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related