Test for multiple strings in multiple columns

Question

I've searched extensively on stack overflow without finding an answer to the following question: Im looking for a function that lets me test if some columns contain any of the specified strings.

# I'm looking for heart attacks
infarction <- c("b", "c")
# I'm also looking for strokes
stroke <- c("h", "i")

#sample data set
set.seed(1234)
dat <- data.frame(A = sample(letters[1:9],10,TRUE),
                  B = sample(letters[1:9],10,TRUE),
                  C = sample(letters[1:9],10,TRUE),
                  D = sample(letters[1:9],10,TRUE),
                  DATE = sample.int(10, size = 10, replace = FALSE))

# I've tried many things. Among them:
# first one using the dplyr package.

infarction = ifelse( (infarction  %in% dat[,c("A", "B", "C", "D")]), DATE, NA))

#excluded a few rows from the mutate...

#I've also tried
grep(paste(infarction,collapse="|"), dat[,1:4], value=TRUE), DATE, NA))

To sum up. I can get it to work if i only check in one column or only look after one of the strings. But I need to check if any of the strings is contained in any of the given columns, and in that case return the date value as a new variable.

Any help is greatly appreciated.

What is the output you are looking for? An array of row numbers? A boolean? — Pop
– Pop, Commented Oct 10, 2014 at 11:56
Given that I'm aiming to create a new variable from this condition, I think a logical output would be preferred. — Adam Robinsson
– Adam Robinsson, Commented Oct 10, 2014 at 13:30

Pop · Accepted Answer · 2014-10-10 12:10:05Z

3

If you just want to know if the columns contain the letters, you can apply to each column of your data.frame the %in% function:

apply(dat[,-5], 2, '%in%', x=stroke)

If you want to know the position of each letter of stroke in your columns, you can use the which function and apply it to each column:

apply(dat[,-5], 2, function(table,x) which(x==table), x=stroke[1])
apply(dat[,-5], 2, function(table,x) which(x==table), x=stroke[2])

or

sapply(stroke, function(y) apply(dat, 2, function(table,x) which(x==table), x=y), simplify=FALSE)

If you want to know the position of all letters of stroke, do this:

apply(dat[,-5], 2, function(x) unlist(sapply(stroke , function(i) which( i == x))))

answered Oct 10, 2014 at 12:10

Pop

12.5k5 gold badges57 silver badges71 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Adam Robinsson Over a year ago

Thanks Pop. I'm looking for an output where an additional variable is added and that variable is equal to the DATE variable for those rows with matching strings. I tried your code but I think I did something wrong; dat$infarction = ifelse((apply(dat[,-5], 2, '%in%', x=ami)), dat$DATE, NA)

konvas · Accepted Answer · 2014-10-10 12:57:00Z

2

It depends on what your desired output is. @Pop's answer is perfectly good I just want to add that your grep approach was on the right track except you can't give it a data frame in the x argument. You have to perform it column-wise. This is because grep will coerce the x argument to a character and you can see if you type as.character(dat[, 1:4]) that you are greping the wrong string. Instead, try for example

grepres <- lapply(dat[1:4], grep, pattern = paste(infarction,collapse="|"))

If you want to find which rows don't contain the pattern you can do for example

idx <- setdiff(1:nrow(dat), unlist(grepres))

and then

result <- rep(NA, nrow(dat))
result[idx] <- dat$DATE[idx]

answered Oct 10, 2014 at 12:57

konvas

14.4k2 gold badges43 silver badges46 bronze badges

Comments

rnso · Accepted Answer · 2014-10-10 15:01:05Z

1

Following may help:

> apply(dat, 1, function(x) ifelse(infarction %in% x[1:5],x[5],NA  ))
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] " 6" NA   " 3" NA   " 2" NA   " 7" NA   " 1" NA   
[2,] " 6" "10" " 3" NA   " 2" NA   " 7" " 9" NA   " 4"

In data.frame form:

> as.data.frame(t(apply(dat, 1, function(x) ifelse(infarction %in% x[1:5],x[5],NA  ))))

     V1   V2
1     6    6
2  <NA>   10
3     3    3
4  <NA> <NA>
5     2    2
6  <NA> <NA>
7     7    7
8  <NA>    9
9     1 <NA>
10 <NA>    4

For stroke:

> as.data.frame(t(apply(dat, 1, function(x) ifelse(stroke %in% x[1:5],x[5],NA  ))))

     V1   V2
1  <NA> <NA>
2  <NA> <NA>
3  <NA> <NA>
4  <NA>    5
5     2 <NA>
6     8 <NA>
7  <NA> <NA>
8  <NA>    9
9     1    1
10    4 <NA>

edited Oct 10, 2014 at 15:01

answered Oct 10, 2014 at 13:29

rnso

24.8k26 gold badges128 silver badges270 bronze badges

Collectives™ on Stack Overflow

Test for multiple strings in multiple columns

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related