Removing characters from multiple rows and multiple columns

Question

I have a dataset that contains 150k rows and 40 columns. The dataset consists of characters, strings as well as integers. I want to remove particular cells which contains a specific keyword/phrase that I want to remove. For example, If the initial dataset looks like

            A            B
 1)             1 | My name is Sam.
 2)         Hello | 2
 3)   Who are you | The water is green.

I want to remove the cell that contains the keyword "is". So that the final dataset will look like:-

            A            B
 1)             1 | NA
 2)         Hello | 2
 3)   Who are you | NA

The keyword can be in any format like a combination of strings or numbers or both.

rg255 · Accepted Answer · 2018-04-23 07:45:49Z

Using grepl you can replace the entries in a single column that contain the word "is" with NA as Akrun has shown. To make that occur over all columns, you can convert the dataframe to a vector first, subbing out the strings with "is", and then rebuilding the dataframe.

k <- 150
n <- 1000 * k
df1 <- data.frame(matrix(sample(c("Hi, my name is Khal Drogo", "Hello Khal Drogo", 1, 2), replace = T, size = n), ncol = k))

dfvec <- as.vector(unlist(df1))
dfvec[grepl("\\bis\\b", dfvec)] <- NA
df1 <- data.frame(matrix(dfvec, ncol = k))
head(df1[,1:5])

Which gives this:

> head(df1[,1:5])
                X1               X2               X3               X4               X5
1 Hello Khal Drogo             <NA>                1                2                1
2 Hello Khal Drogo                2             <NA>                1 Hello Khal Drogo
3 Hello Khal Drogo Hello Khal Drogo                1                1             <NA>
4             <NA>                1 Hello Khal Drogo                2                2
5                2 Hello Khal Drogo             <NA>                2                1
6                2 Hello Khal Drogo Hello Khal Drogo Hello Khal Drogo                2

akrun · Accepted Answer · 2018-04-23 08:57:41Z

2

We can assign it to NA by creating a logical index with grep

df1$B[grepl("\\bis\\b", df1$B)] <- NA

If is can be also a substring, then remove the word boundary

df1$B[grepl("is", df1$B)] <- NA

It is also possible that the OP wants to assign all non-numeric elements to NA. In that case, just converting the column to numeric (assuming it is character class)

df1$B <- as.numeric(df1$B)

Update

If there are multiple columns, then we can use lapply

colsofInterest <- 2:ncol(df1)
df1[colsofInterest] <- lapply(df1[colsofInterest],
         function(x) replace(x, grepl("\\bis\\b", x), NA))

Or this can be also done with tidyverse

library(tdivyerse)
df1 %>%
     mutate_at(vars(colsofInterest), funs(replace(., str_detect(., "\\bis\\b"), NA)))

edited Apr 23, 2018 at 8:57

answered Apr 23, 2018 at 7:19

akrun

891k38 gold badges590 silver badges700 bronze badges

2 Comments

Gothram Over a year ago

Thanks for the help. Will try it out

Gothram Over a year ago

Thanks for the updated code @akrun . Both methods work perfectly and give the correct output. Never thought getting this result would be so simple. Hope I will be able to help you in the future! :)

Collectives™ on Stack Overflow

Removing characters from multiple rows and multiple columns

2 Answers 2

Comments

Update

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Update

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related