Comparing and filtering across multiple columns using string matching in R

Question

I have a dataframe like so:

    TS    Device1.max   Device2.max   Device3.max       Device4.max
18:02:44     FALSE        FALSE          TRUE               FALSE
18:02:45     TRUE         TRUE          FALSE               FALSE
18:02:46     FALSE        FALSE         FALSE               TRUE
18:02:47     FALSE        FALSE         FALSE               FALSE
18:02:48     FALSE        FALSE         FALSE               FALSE
18:02:49     FALSE        FALSE         FALSE               FALSE
18:02:50     FALSE        FALSE         FALSE               FALSE
18:02:51     FALSE        FALSE         FALSE               FALSE
18:02:52     FALSE        FALSE         FALSE               TRUE
18:02:53     FALSE        TRUE          FALSE               FALSE
18:02:54     FALSE        FALSE         FALSE               FALSE

To get the true false columns I used the following code:

df$Device1.max = ifelse(df$Device1 == max(df$Device1), 'true','false')
df$Device2.max = ifelse(df$Device2 == max(df$Device2), 'true','false')
df$Device3.max = ifelse(df$Device3 == max(df$Device3), 'true','false')
df$Device4.max = ifelse(df$Device4 == max(df$Device4), 'true','false')

For simplicity I am only showing 4 Device columns. I have about a hundred device columns where I would like to do the comparison. It wont be feasible to specify the hundred columns in a hundred ifelse statements How do I compare using regex or specifying generic column name assuming all the device columns of interest will have some sort of name starting like device?

I then want to filter to or find the row where maximum Device.max columns satisfy the condition where it is TRUE within +/-1 row of it. Algorithmically, I would create an index column and filter to a dataframe where only the TRUE values are present. Then I would check how many columns have indices within 1 row of each other. In the above case rows 1,2 & 3 have 4 columns satisfying the true condition, whereas rows 9 & 10 have only 2 columns satisfying the condition. Therefore my expected output would be:

     TS      Device1.max    Device2.max   Device3.max         Device4.max
    18:02:44     FALSE        FALSE          TRUE               FALSE
    18:02:45     TRUE         TRUE          FALSE               FALSE
    18:02:46     FALSE        FALSE         FALSE               TRUE

However this method seems very iterative and inefficient. Is there a better way to do it leveraging dataframe functions in R?

Sebek · Accepted Answer · 2022-06-01 08:18:53Z

1

This code should answer the first TRUE/FALSE question

r <- c();
colum <- c();
for (colu in 2:ncol(example_table)){
  example_table[ ,colu] <- example_table[ ,colu]==max(example_table[ ,colu]) # returns True/False
  val <- which(example_table[ ,colu]==T) # searching for row indexes
  r <- append(r,val) # append row indexes
  colum <- append(colum,rep(colu,length(val))) # since one column can contain more than one True, repeat that column index and append it
}
true_values <- cbind(r,colum) # just a matrix-like output

out:

> example_table
     V1 V2 V3 V4
1 18:02  5  8  1
2 14:05  7  1  7
3 19:27  7  6  1

# After for:

> example_table
     V1    V2    V3    V4
1 18:02 FALSE  TRUE FALSE
2 14:05  TRUE FALSE  TRUE
3 19:27  TRUE FALSE FALSE

> true_values
     r colum
[1,] 2     2
[2,] 3     2
[3,] 1     3
[4,] 2     4

where r is the row index and colum is the column index which contains true values. Note that example_table[ ,colu]==max(example_table[ ,colu]) returns a TRUE/FALSE value, and note that colum <- append(colum,rep(colu,length(val))) is needed to avoid future dimensional problems too.

For the second question, you have now row indexes which contains TRUE values. You can then implement a code that when rows above and under the selected row contains a TRUE value (any() function should be a good one), select that row. Then subset() original data frame with this row indexes.

answered Jun 1, 2022 at 8:18

Sebek

1766 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

thentangler Over a year ago

Thank you so much for that clever solution! I have a couple of questions: 1. Are the semicolons initializing r and colum mandatory? It seemed to work for me without them 2. Is is possible assign a different column name when doing the check? Like df[ ,colu_max] <- df[ ,colu]==max(df[ ,colu]) ?

thentangler Over a year ago

I was able to do it using the paste function in column names. thank You so much for getting me this far!

Sebek Over a year ago

1. No, semicolons are not necessary. 2. No, you can't change it until the loop has finished. You can change "colu" variable in loop, but must be changed inside all the elements. Once loop is done, you can, as you said, change columnames using colnames(your data frame) function.

Collectives™ on Stack Overflow

Comparing and filtering across multiple columns using string matching in R

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related