1

I have a large dataframe (3000+ columns) and I am trying to get a list of all column names that follow this pattern:

"stat.mineBlock.minecraft.123456stone"
"stat.mineBlock.minecraft.DFHFFBSBstone2"
"stat.mineBlock.minecraft.AAAstoneAAAA"

My code:

stoneCombined<-grep("^[stat.mineBlock.minecraft.][a-zA-Z0-9]*?[stone][a-zA-Z0-9]*?", colnames(ingame), ignore.case =T)

where ingame is the dataframe I am searching. My code returns a list of numbers however instead of the dataframe columns (like those above) that I was expecting. Con someone tell me why?

After adding value=TRUE (Thanks to user227710):

I now get column names, but I get every column in my dataset not those that contain : stat.mineBlock.minecraft. and stone like I was trying to get.

1
  • 2
    It would help to show an example of a valid reject string to compare. Commented May 21, 2015 at 20:37

3 Answers 3

3

To return the column names you need to set value=TRUE as an additional argument of grep. The default option in grep is to set value=FALSE and so it will give you indices of the matched colnames. .

help("grep") 
value   
if FALSE, a vector containing the (integer) indices of the matches determined by grep is returned, and if TRUE, a vector containing the matching elements themselves is returned.

grep("your regex pattern", colnames(ingame),value=TRUE, ignore.case =T)
Sign up to request clarification or add additional context in comments.

4 Comments

Please explain the reason for downvoting so that I can improve my answer.
In all fairness, this is OP's regexp. The answer in itself is correct as it solves the original problem.
@user227710, Thanks for your suggestion. I updated my post in response.
@Rilcon42: No problem. Please post that as another question and with sample example (with at least 10 colnames ) so that solution can be generalized.
2

Here is a solution in dplyr:

library(dplyr)
your_df %>%
  select(starts_with("stat.mineBlock.minecraft"))

The more general way to match a column name to a regex is with matches() inside select(). See ?select for more information.

4 Comments

I like the dplyr solution, but I hate to add another package. Is there a reason (speed, it is more R like, etc.) to use this over the accepted solution?
Just a comment: This does not check for stone.
@stribizhev If you had to check for stone, you could replace the line select(starts_with("stat.mineBlock.minecraft")) with select(matches("stat\\.mineBlock\\.minecraft\\..+stone.+")) or some other suitable regex.
@Rilcon42 The dplyr package provides two things. First, in my experience for most data manipulation it is faster than base R. Second, and more important, it provides a grammar of data manipulation that makes more sense to the human writing the code, and so it is faster in that sense. For instance, selecting columns, filtering rows, grouping and summarizing datasets and more provide a whole set of tools for manipulating data.
0

My answer is based on this SO post. As per the regex, you were very close. Just [] create a character class matching a single character from the defined set, and it is the main reason it was not working. Also, perl=T is always safer to use with regex in R.

So, here is my sample code:

df <- data.frame(
  "stat.mineBlock.minecraft.123456stone" = 1,
  "stat.mineBlock.minecraft.DFHFFBSBwater2" = 2,
  "stat.mineBlock.minecraft.DFHFFBSBwater3" = 3,
  "stat.mineBlock.minecraft.DFHFFBSBstone4" = 4
)
grep("^stat\\.mineBlock\\.minecraft\\.[a-zA-Z0-9]*?stone[a-zA-Z0-9]*?", colnames(df), value=TRUE, ignore.case=T, perl=T)

See IDEONE demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.