R: delete a substring from a string on a dataframe [duplicate]

Question

I have a dataframe containing the mark and the name of many products as follows:

    mark      name
    Caudalie  Caudalie Eau démaquillante 200ml
    Mustela   Mustela Bébé lait hydra corps 300ml
    Lierac    Lierac Phytolastil gel prévention

In many rows, the mark exist in the product name. What I want to do is to detect if the mark exists in the product name, If so I want to remove It.

Edit: I used this sample of code to detect if the mark exists in the product name:

   df1$CheckMark <- Vectorize(grepl)(df1$mark, df1$name)

My dataframe looks like this now:

    mark      name                                ChekMark
    Caudalie  Caudalie Eau démaquillante 200ml    TRUE
    Mustela   Mustela Bébé lait hydra corps 300ml TRUE
    Lierac    Lierac Phytolastil gel prévention   TRUE

I want to subset the mark from the product name.

UPDATE After many attempts. I switched my big dataframe to a list according to the mark:

    list.mark.name=split( df1 , df1$mark )

And I found this awesome combination between sapply and gsub:

    listt<-sapply(1:length(list.marque.nom), function(i)
    {
     dtfr<-list.marque.nom[[i]]
      if(dtfr$CheckMark==TRUE)
     {listt[[i]]<-as.data.frame(sapply(dtfr,gsub,pattern=dtfr$mark,replacement=""))}
      else
     {listt[[i]]<-dtfr} 
     }

I thought that everything is okey but I noticed these warnings:

     Warning messages:
     1: In if (dtfr$CheckMark == TRUE) { ... :
      the condition has length > 1 and only the first element will be used

What's the problem please.

Any help would be appreciated.

Can you elaborate on what you've tried already, i.e. post some of the code? — erc
– erc, Commented Jan 18, 2016 at 11:58

Community · Accepted Answer · 2020-06-20 09:12:55Z

1

If we need to subset the rows by removing the "name" elements that starts with 'mark', then use grep

df1[!grepl('^mark', df1$name),]

The ^ signifies the start of the string.

NOTE: The subtract part in the title is not clear.

Update

Based on the updated dataset, if we want to check 'name' that doesn't have a matching substring in any of the 'mark' elements, we can paste the 'mark' elements together and use grep to get the index and then subset with [,

df1[!grepl(paste(df1$mark, collapse="|"), df1$name),]

Or if the idea is to subset rows based on corresponding elements of 'name', 'mark', stri_detect from stringi is an option.

library(stringi)
df1[!stri_detect_fixed(df1$name, df1$mark),]

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jan 18, 2016 at 11:57

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

akrun Over a year ago

@user5779182 Check if the update helps.

talat Over a year ago

df1[!grepl(paste(df1$mark, collapse="|"), df1$name),] will also remove a row where there is a mark-name from a different row - not sure if this is desired

akrun Over a year ago

@docendodiscimus As the OP didn't show the expected result, I added both options. The stringi should work on each row.

sarah Over a year ago

@akrun . It's okey for the grepl function : I used this sample of code df1$CheckMark <- Vectorize(grepl)(df1$mark, df1$name) . I want now to remove the mark from the product name. Any idea ?

sarah Over a year ago

@akrun , switching sapply by mapply resolved the problem : df1=as.data.frame(mapply(gsub,df1$mark,"",df1$name)). Thank you for your time and efforts.

|

Collectives™ on Stack Overflow

R: delete a substring from a string on a dataframe [duplicate]

1 Answer 1

Update

11 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Update

11 Comments

Linked

Related