filtering columns by regex in dataframe

Question

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:

"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"

My code:

stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)

where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?

After adding value=TRUE (Thanks to user227710):

I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.

It would help to show an example of a valid reject string to compare. — Pierre L
– Pierre L, Commented May 21, 2015 at 20:37

user227710 · Accepted Answer · 2015-05-21 20:45:45Z

3

To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .

help("grep") 
value   
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.

grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)

edited May 21, 2015 at 20:45

answered May 21, 2015 at 20:33

user227710

3,19420 silver badges36 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user227710 Over a year ago

Please explain the reason for downvoting so that I can improve my answer.

Anton Over a year ago

In all fairness, this is OP's regexp. The answer in itself is correct as it solves the original problem.

Rilcon42 Over a year ago

@user227710, Thanks for your suggestion. I updated my post in response.

user227710 Over a year ago

@Rilcon42: No problem. Please post that as another question and with sample example (with at least 10 colnames ) so that solution can be generalized.

Lincoln Mullen · Accepted Answer · 2015-05-21 21:15:16Z

2

Here is a solution in dplyr:

library(dplyr)
your_df %>%
  select(starts_with("stat.mineBlock.minecraft"))

The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.

answered May 21, 2015 at 21:15

Lincoln Mullen

6,4954 gold badges29 silver badges30 bronze badges

4 Comments

Rilcon42 Over a year ago

I like the dplyr solution, but I hate to add another package. Is there a reason (speed, it is more R like, etc.) to use this over the accepted solution?

Wiktor Stribiżew Over a year ago

Just a comment: This does not check for stone.

Lincoln Mullen Over a year ago

@stribizhev If you had to check for stone, you could replace the line select(starts_with("stat.mineBlock.minecraft")) with select(matches("stat\\.mineBlock\\.minecraft\\..+stone.+")) or some other suitable regex.

Lincoln Mullen Over a year ago

@Rilcon42 The dplyr package provides two things. First, in my experience for most data manipulation it is faster than base R. Second, and more important, it provides a grammar of data manipulation that makes more sense to the human writing the code, and so it is faster in that sense. For instance, selecting columns, filtering rows, grouping and summarizing datasets and more provide a whole set of tools for manipulating data.

Community · Accepted Answer · 2017-05-23 12:10:21Z

0

My answer is based on this SO post. As per the regex, you were very close. Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.

So, here is my sample code:

df <- data.frame(
  "stat.mineBlock.minecraft.123456stone" = 1,
  "stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
  "stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
  "stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)

See IDEONE demo

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered May 21, 2015 at 21:38

Wiktor Stribiżew

631k41 gold badges502 silver badges633 bronze badges

Collectives™ on Stack Overflow

filtering columns by regex in dataframe

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related