0

I would like to replace/remove those parts of a string (name) that match to other columns (stateand city) in my data table.

I managed to identify the rows, e.g. with city, like so: dt%>% filter(str_detect(name, city)) but I am missing a way to use gsub (or grep) with the rowwise value of the column city.

I know that a rather manual approach like storing all city names in a vector and enter them in gsub would work but it would also falsely remove the "dallas" of row 2. (This was manageable for states though and could be combined with gsub to also remove "of".)


Data and desired output

dt<- data.table(city = c("arecibo","arecibo","cabo rojo", "new york", "dallas"), 
state=c("pr", "pr", "pr", "ny", "tx"), 
name=c("frutas of pr arecibo", "dallas frutas of pr", "cabo rojo metal plant", "greens new york", "cowboy shoes dallas tx"), 
desired=c("frutas", "dallas frutas", "metal plant", "greens", "cowboy shoes"))

3 Answers 3

3

With dplyr, we can use rowwise. First collapse all words to remove into a single character element with the OR metacharacter (as in 'arecibo|pr|of'), then call str_remove_all with that pattern. Finally, remove remaining whitespaces.

library(dplyr)
library(stringr)

dt %>%
    rowwise()%>%
    mutate(desired_2 = str_remove_all(name, paste(c(city, state, 'of'), collapse = '|'))%>%
               trimws())

# A tibble: 5 × 5
# Rowwise: 
  city      state name                   desired       desired_2    
  <chr>     <chr> <chr>                  <chr>         <chr>        
1 arecibo   pr    frutas of pr arecibo   frutas        frutas       
2 arecibo   pr    dallas frutas of pr    dallas frutas dallas frutas
3 cabo rojo pr    cabo rojo metal plant  metal plant   metal plant  
4 new york  ny    greens new york        greens        greens       
5 dallas    tx    cowboy shoes dallas tx cowboy shoes  cowboy shoes 
Sign up to request clarification or add additional context in comments.

1 Comment

Sorry, we have to use rowwise to achieve that, which I used originally, but eventually removed. It is ok now. Please test the updated answer now.
2

A data.table solution:

# Helper function
subxy <-  function(string, rmv) mapply(function(x, y) sub(x, '', y), rmv, string)

dt[,  desired2 := name |> subxy(city) |> subxy(state) |> subxy('of') |> trimws()]

#         city state                   name       desired      desired2
# 1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
# 2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
# 3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
# 4:  new york    ny        greens new york        greens        greens
# 5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

4 Comments

@Magasinus you are using an old version of R but to mitigate you can replace \(x, y) with function(x,y)
and |> with the magrittr pipe %>%.
This solution is great but has the same issue when applied to real data: it also cleans real names that start with the state abbreviation. Is there a more elegant way to account for this than: subxy <- function(string, rmv) mapply(function(x, y) sub(paste(" ",x," ",sep=""), '', paste(" ",y," ",sep="")), rmv, string) and does this really do the job?
@Magasinus You can take that into account by requiring space before the statename, simply feeding subxy() with paste0(' ', state).
2

Here's a solution, but it can probably be achieved faster with gsub methods. Anyway :

library(tidyverse)


  dt %>% 
  mutate(test = str_remove_all(name,city)) %>% 
  mutate(test = str_remove_all(test,paste(" of ",state,sep=""))) %>% 
  mutate(test = str_remove_all(test,state)) %>% 
  mutate(test = str_remove_all(test,"^ ")) %>% 
  mutate(test = str_remove_all(test," *$"))

Output:

        city state                   name       desired          test
1:   arecibo    pr   frutas of pr arecibo        frutas        frutas
2:   arecibo    pr    dallas frutas of pr dallas frutas dallas frutas
3: cabo rojo    pr  cabo rojo metal plant   metal plant   metal plant
4:  new york    ny        greens new york        greens        greens
5:    dallas    tx cowboy shoes dallas tx  cowboy shoes  cowboy shoes

3 Comments

Can we combine this to replace "of" only when followed by a space and state?
Sure, see the edit @Magasinus
With real data, we have to be careful given states like "ma". It might be advisable to use mutate(test = str_remove_all(test,paste(" ",state," ", sep="")).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.