1

I have a dataset that contains 150k rows and 40 columns. The dataset consists of characters, strings as well as integers. I want to remove particular cells which contains a specific keyword/phrase that I want to remove. For example, If the initial dataset looks like

            A            B
 1)             1 | My name is Sam.
 2)         Hello | 2
 3)   Who are you | The water is green.

I want to remove the cell that contains the keyword "is". So that the final dataset will look like:-

            A            B
 1)             1 | NA
 2)         Hello | 2
 3)   Who are you | NA

The keyword can be in any format like a combination of strings or numbers or both.

2 Answers 2

2

Using grepl you can replace the entries in a single column that contain the word "is" with NA as Akrun has shown. To make that occur over all columns, you can convert the dataframe to a vector first, subbing out the strings with "is", and then rebuilding the dataframe.

k <- 150
n <- 1000 * k
df1 <- data.frame(matrix(sample(c("Hi, my name is Khal Drogo", "Hello Khal Drogo", 1, 2), replace = T, size = n), ncol = k))

dfvec <- as.vector(unlist(df1))
dfvec[grepl("\\bis\\b", dfvec)] <- NA
df1 <- data.frame(matrix(dfvec, ncol = k))
head(df1[,1:5])

Which gives this:

> head(df1[,1:5])
                X1               X2               X3               X4               X5
1 Hello Khal Drogo             <NA>                1                2                1
2 Hello Khal Drogo                2             <NA>                1 Hello Khal Drogo
3 Hello Khal Drogo Hello Khal Drogo                1                1             <NA>
4             <NA>                1 Hello Khal Drogo                2                2
5                2 Hello Khal Drogo             <NA>                2                1
6                2 Hello Khal Drogo Hello Khal Drogo Hello Khal Drogo                2
Sign up to request clarification or add additional context in comments.

Comments

2

We can assign it to NA by creating a logical index with grep

df1$B[grepl("\\bis\\b", df1$B)] <- NA

If is can be also a substring, then remove the word boundary

df1$B[grepl("is", df1$B)] <- NA

It is also possible that the OP wants to assign all non-numeric elements to NA. In that case, just converting the column to numeric (assuming it is character class)

df1$B <- as.numeric(df1$B)

Update

If there are multiple columns, then we can use lapply

colsofInterest <- 2:ncol(df1)
df1[colsofInterest] <- lapply(df1[colsofInterest],
         function(x) replace(x, grepl("\\bis\\b", x), NA))

Or this can be also done with tidyverse

library(tdivyerse)
df1 %>%
     mutate_at(vars(colsofInterest), funs(replace(., str_detect(., "\\bis\\b"), NA)))

2 Comments

Thanks for the help. Will try it out
Thanks for the updated code @akrun . Both methods work perfectly and give the correct output. Never thought getting this result would be so simple. Hope I will be able to help you in the future! :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.